Performance Analysis Tutorial

How do I Measure "Performance"?

The performance of a network system is measured by two primary characteristics:

Latency
- The time (delay) it takes for a message to be transported across the network
Throughput
- The rate at which the message is transmitted across the network

For instance, a round-the-world fiber link can have very high bandwidth (as measured in bytes/sec), but the time required for the optical signal to propagate through the cable can be high (hundreds or thousands of ms).

Questions:

Why is latency a concern? Isn't low bandwidth what is slowing my downloads?
What applications are particularly latency-sensitive?

Further Reading (Optional):

It's the Latency, Stupid (a few years old, but still relevant today)

What Affects Network System Performance?

The performance of a network system (i.e. sending a message from computer A to computer B across a network) can be influenced by many different subsystems, including:

Host system (CPU/Memory)
- Does the host system have sufficient processor and memory resources to support the desired application and operating system network stack?
Host system (Interconnect)
- Does the interconnect between the host system and NIC (PCI, PCI-X, PCI Express) have sufficient bandwidth?
- What latency does the interconnect add?
NIC
- Does the network interface card have sufficient resources to transmit or receive all packets requested of it?
- What latency does the NIC add?
Network
- What is the raw channel capacity (bandwidth) of the network "wires"?
- Do all network devices (routers/switches/etc...) have sufficient bandwidth? Does this bandwidth vary based on the type of packet being transmitted?
- What is the raw latency of the network "wires"?
- How much latency do the routers/switches add because of processing/queueing? Does this latency vary based on the type of packet being transmitted?
Protocols
- UDP transmits packets at the maximum rate possible by the host, regardless of any bottlenecks or packet loss downstream
- TCP attempts to regulate transmission bandwidth to avoid packet loss (or corruption) downstream
  - Additive-Increase, Multiplicative-Decrease - TCP starts by transmitting at a slow rate, and quickly increases linearly. At some point, however, the network system will become saturated and packet loss will occur. Upon detecting packet loss, TCP slashes its bandwidth usage in half, retransmits the lost data, and begins increasing its transmission speed again. This creates a "sawtooth" effect when plotting the achieved bandwidth.
  - Bandwidth-Delay Product - The product of a network system's bandwidth (bits/sec) and latency (sec). This metric refers to the amount of data "in transit" in the network, and can be quite large on high-bandwidth, high-latency systems such as satellite networks. The bandwidth-delay product is used to calculate how much buffering is required on the transmitting system in order to retransmit any packets that were lost in transmit.

Note that the performance constraints described above may not be symmetric. For instance, it is almost always more efficient for a computer network stack and device driver to transmit a series of packets than receive them, which allows for greater achievable transmit bandwidth. Also, packets can have asymmetric routing across the internet, and take a different path from A->B than B->A.

Performance Analysis Tools

The following tools all measure different aspects of performance, and together can form a useful (but incomplete) view of the system.

top

top provides a real-time view of a computer system and its active processes. This application can be used to determine if the CPU is saturated or memory resources exhausted, which would make it likely that the host computer is the cause of a network performance bottleneck.

Options:

Press "1" after running to toggle between an averaged CPU view and per-CPU statistics. The per-CPU statistics are especially helpful. Imagine an 8-CPU system running a single-threaded network application. That network application could be consuming 100% of 1 CPU and thus be the performance bottleneck, but the (default) average CPU metric will report a very-misleading 87.5% idle.
Press "q" to quit the program

Example:

shafer@comp519:~$ top
top - 21:18:50 up 56 days,  4:57, 12 users,  load average: 1.00, 0.97, 0.74
Tasks: 249 total,   2 running, 244 sleeping,   2 stopped,   1 zombie
Cpu0  :  1.7%us,  0.0%sy,  0.0%ni, 98.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  1.5%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  1.0%us,  0.0%sy,  0.0%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  1.6%us,  0.0%sy,  0.0%ni, 98.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  1.0%us,  0.0%sy,  0.0%ni, 98.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  1.0%us,  0.1%sy,  0.0%ni, 98.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  1.5%us,  0.0%sy,  0.0%ni, 98.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  1.6%us,  0.0%sy,  0.0%ni, 98.2%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8187372k total,  6018464k used,  2168908k free,   532236k buffers
Swap: 31246344k total,   513256k used, 30733088k free,  3545432k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                            
 2690 bes4623   25   0 1353m 1.2g  29m R  100 15.2   2:59.22 par                                
 2748 shafer    15   0 19084 1444  972 R    1  0.0   0:00.01 top                                
    1 root      18   0  3960  264  184 S    0  0.0   0:07.96 init                               
    2 root      10  -5     0    0    0 S    0  0.0   0:00.00 kthreadd                           
    3 root      RT  -5     0    0    0 S    0  0.0   0:00.60 migration/0                        
    4 root      34  19     0    0    0 S    0  0.0   0:00.00 ksoftirqd/0                        
    5 root      RT  -5     0    0    0 S    0  0.0   0:00.00 watchdog/0                         
    6 root      RT  -5     0    0    0 S    0  0.0   0:00.03 migration/1                        
    7 root      34  19     0    0    0 S    0  0.0   0:00.03 ksoftirqd/1                        
    8 root      RT  -5     0    0    0 S    0  0.0   0:00.00 watchdog/1                         
    9 root      RT  -5     0    0    0 S    0  0.0   0:00.33 migration/2                        
   10 root      34  19     0    0    0 S    0  0.0   0:00.01 ksoftirqd/2                        
   11 root      RT  -5     0    0    0 S    0  0.0   0:00.00 watchdog/2                         
   12 root      RT  -5     0    0    0 S    0  0.0   0:05.24 migration/3                        
   13 root      34  19     0    0    0 S    0  0.0   0:00.06 ksoftirqd/3                        
   14 root      RT  -5     0    0    0 S    0  0.0   0:00.00 watchdog/3                         
   15 root      RT  -5     0    0    0 S    0  0.0   0:00.39 migration/4                        
   16 root      34  19     0    0    0 S    0  0.0   0:00.00 ksoftirqd/4                        
   17 root      RT  -5     0    0    0 S    0  0.0   0:00.00 watchdog/4                         
   18 root      RT  -5     0    0    0 S    0  0.0   0:00.05 migration/5                        
   19 root      34  19     0    0    0 S    0  0.0   0:00.04 ksoftirqd/5

Questions:

Will top provide useful information on every bottleneck that might affect the host computer system? What is missing?
If you had a hypothesis that the host computer (CPU, memory, ...) was the bottleneck, what kind of experiment could you conduct to confirm that?

Ping

Ping sends a 56-byte ICMP echo packet across the network, and the receiving host sends a 56-byte ICMP echo reply response. The round-trip time (e.g latency) to complete this process is measured and displayed per ping, and statistics (minimum, maximum, average, and mean deviation) calculated over all ping packets sent.

shafer@comp519:~$ ping -c 5 rice.edu
PING rice.edu (128.42.5.4) 56(84) bytes of data.
64 bytes from moe.rice.edu (128.42.5.4): icmp_seq=1 ttl=125 time=1.25 ms
64 bytes from moe.rice.edu (128.42.5.4): icmp_seq=2 ttl=125 time=1.06 ms
64 bytes from moe.rice.edu (128.42.5.4): icmp_seq=3 ttl=125 time=1.31 ms
64 bytes from moe.rice.edu (128.42.5.4): icmp_seq=4 ttl=125 time=1.07 ms
64 bytes from moe.rice.edu (128.42.5.4): icmp_seq=5 ttl=125 time=1.16 ms

--- rice.edu ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4000ms
rtt min/avg/max/mdev = 1.067/1.174/1.313/0.100 ms

Examples
Pinging NetFPGA router IP:

shafer@nf-server1:~$ ping 10.143.206.65
PING 10.143.206.65 (10.143.206.65) 56(84) bytes of data.
64 bytes from 10.143.206.65: icmp_seq=1 ttl=64 time=0.973 ms
64 bytes from 10.143.206.65: icmp_seq=2 ttl=64 time=0.208 ms
64 bytes from 10.143.206.65: icmp_seq=3 ttl=64 time=0.215 ms
64 bytes from 10.143.206.65: icmp_seq=4 ttl=64 time=0.216 ms
64 bytes from 10.143.206.65: icmp_seq=5 ttl=64 time=0.207 ms

--- 10.143.206.65 ping statistics ---
6 packets transmitted, 6 received, 0% packet loss, time 5002ms
rtt min/avg/max/mdev = 0.201/0.336/0.973/0.285 ms

Pinging through NetFPGA router to Server2:

shafer@nf-server1:~$ ping 10.143.206.130
PING 10.143.206.130 (10.143.206.130) 56(84) bytes of data.
64 bytes from 10.143.206.130: icmp_seq=1 ttl=63 time=0.097 ms
64 bytes from 10.143.206.130: icmp_seq=2 ttl=63 time=0.095 ms
64 bytes from 10.143.206.130: icmp_seq=3 ttl=63 time=0.091 ms
64 bytes from 10.143.206.130: icmp_seq=4 ttl=63 time=0.091 ms
64 bytes from 10.143.206.130: icmp_seq=5 ttl=63 time=0.092 ms
64 bytes from 10.143.206.130: icmp_seq=6 ttl=63 time=0.091 ms

--- 10.143.206.130 ping statistics ---
6 packets transmitted, 6 received, 0% packet loss, time 4997ms
rtt min/avg/max/mdev = 0.091/0.092/0.097/0.012 ms

Questions:

What network systems are included in the round-trip time that ping measures?
1. Ping application processing?
2. OS network stack processing?
3. OS / driver processing?
4. NIC processing?
5. Router processing?
6. Switch queueing?
Can ping be used to measure bandwidth?
Why is pinging the NetFPGA router slower than pinging a server on the other side of the router?

netstat

netstat can be used to display statistics from the Linux network stack. Of particular interest are statistics regarding TCP packet errors due to packets being lost or corrupted. These errors are not visible to the end-user, because TCP provides the abstract of a reliable network. Behind the scenes, however, every lost packet must be retransmitted, greatly increasing network latency. In addition, when TCP encounters data loss (even a single packet!), it assumes that the network is congested and throttles its bandwidth use accordingly, further degrading network performance.

Example:

shafer@comp519:~$ netstat -s
Ip:
    313592135 total packets received
    18 with invalid headers
    14 with invalid addresses
    1133352 forwarded
    0 incoming packets discarded
    310573355 incoming packets delivered
    315803348 requests sent out
    8 dropped because of missing route
    2239242 reassemblies required
    1119520 packets reassembled ok
    1 packet reassembles failed
Icmp:
    85131 ICMP messages received
    325 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 1095
        timeout in transit: 6076
        echo requests: 2125
        echo replies: 75587
    1485099 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 1482962
        time exceeded: 12
        echo replies: 2125
Tcp:
    34964 active connections openings
    181752 passive connection openings
    1280 failed connection attempts
    1779 connection resets received
    21 connections established
    305044644 segments received
    310870496 segments send out
    86523 segments retransmited
    1 bad segments received.
    2327 resets sent
Udp:
    3683271 packets received
    1482715 packets to unknown port received.
    11 packet receive errors
    3545516 packets sent
UdpLite:
TcpExt:
    54 resets received for embryonic SYN_RECV sockets
    3491 packets pruned from receive queue because of socket buffer overrun
    3 ICMP packets dropped because they were out-of-window
    43350 TCP sockets finished time wait in fast timer
    24 time wait sockets recycled by time stamp
    49 packets rejects in established connections because of timestamp
    6586019 delayed acks sent
    18297 delayed acks further delayed because of locked socket
    Quick ack mode was activated 15469 times
    3498683 packets directly queued to recvmsg prequeue.
    2425210898 of bytes directly received from backlog
    1811226734 of bytes directly received from prequeue
    154083839 packet headers predicted
    571946 packets header predicted and directly queued to user
    14241747 acknowledgments not containing data received
    112224177 predicted acknowledgments
    23 times recovered from packet loss due to fast retransmit
    8618 times recovered from packet loss due to SACK data
    1 bad SACKs received
    Detected reordering 23 times using FACK
    Detected reordering 32 times using SACK
    Detected reordering 4 times using reno fast retransmit
    Detected reordering 657 times using time stamp
    650 congestion windows fully recovered
    1974 congestion windows partially recovered using Hoe heuristic
    TCPDSACKUndo: 161
    2337 congestion windows recovered after partial ack
    5794 TCP data loss events
    2 timeouts after reno fast retransmit
    4343 timeouts after SACK recovery
    124 timeouts in loss state
    10085 fast retransmits
    2141 forward retransmits
    4473 retransmits in slow start
    59535 other TCP timeouts
    TCPRenoRecoveryFail: 1
    163 sack retransmits failed
    30 times receiver scheduled too late for direct processing
    119701 packets collapsed in receive queue due to low socket buffer
    20275 DSACKs sent for old packets
    11 DSACKs sent for out of order packets
    63582 DSACKs received
    331 connections reset due to unexpected data
    693 connections reset due to early user close
    291 connections aborted due to timeout
IpExt:
    InMcastPkts: 356399
    OutMcastPkts: 56443
    InBcastPkts: 111965

Questions:

If netstat is reporting TCP data loss events, how can we determine where packets might be corrupted or dropped?
Where else might we get information on lost or corrupted packets?

netperf

netperf (http://www.netperf.org/) is a network benchmarking tool that can be used to perform a variety of tests:

TCP and UDP unidirectional streaming bandwidth test using standard Sockets interface
TCP and UDP request/response latency test using standard Sockets interface

netperf is split into two pieces: an application client, and application server. It is able to stream data between the two applications across the network, and communicate via an independent control connection.

Options:

-h - Specify the host (netperf server) to connect to for testing
-l - Specify duration (length) of test to be performed
-t - Specify the protocol used:
- TCP_STREAM - High-bandwidth TCP streaming test from client to server. Results is bandwidth measurement (Mbit/sec)
- TCP_MAERTS - High-bandwidth TCP streaming test from server to client (MAERTS is STREAM spelled backwards)
- UDP_STREAM - High-bandwidth UDP streaming test from client to server
- UDP_MAERTS - High-bandwidth UDP streaming test from server to client
- TCP_RR - Request/response test - TCP "ping" from user-space client to user-space server, and back again. Result is not elapsed time per "ping", but rather average number of "pings" per second.
- UDP_RR - Request/response test - UDP "ping" from user-space client to user-space server, and back again

Examples:
TCP stream from server1 to server2 via NetFPGA router:

shafer@nf-server1:~$ netperf -H 10.143.206.130 -t TCP_STREAM
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.143.206.130 (10.143.206.130) port 0 AF_INET
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 87380  16384  16384    10.03     940.87

TCP request-response test from server1 to server2 via NetFPGA router:

shafer@nf-server1:~$ netperf -H 10.143.206.130 -t TCP_RR    
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.143.206.130 (10.143.206.130) port 0 AF_INET
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  87380  1        1       10.02    10718.38   
16384  87380

Questions:

Why is it a bad idea to use netperf (or any other TCP streaming test) on the public internet, or public Pacific network?
What happens if all the groups do a netperf TCP stream test from server1 to server2 at the same time? (via their own routers). What possible bottlenecks might emerge?

PCI Bus Analyzer

Imagine a product that works like Wireshark, but instead of capturing and analyzing traffic over your network, it captures and analyzes traffic over your PCI bus (or PCI-X, PCI Express, etc...). For the everyday low price of tens of thousands of dollars, such a tool can be yours!

Capabilities:

Capture all bus traffic for analysis (capture length dependent on device memory), allowing you to view individual packets and control data moving across the bus (perfect for analyzing how the driver and NIC communicate!)
Capture based on a trigger such as a bus read or write to a particular address
Extensive visualization tools to view overall bus utilization and break down by traffic type
Detection of invalid bus transfers due to protocol errors
Initiate transfers on the PCI bus (i.e. by pretending that you are the host or a peripheral device)

NetFPGA Questions

How could you modify / extend your own router design to gain additional information on potential router performance bottlenecks?
How could you construct a test that would explore the performance limitations of the your software? (and the PCI interconnect between the NetFPGA board and host system)