Performance Analysis Tutorial
How do I Measure "Performance"?
The performance of a network system is measured by two primary characteristics:
- Latency
- The time (delay) it takes for a message to be transported across the network
- Throughput
- The rate at which the message is transmitted across the network
For instance, a round-the-world fiber link can have very high bandwidth (as measured in bytes/sec), but the time required for the optical signal to propagate through the cable can be high (hundreds or thousands of ms).
Questions:
- Why is latency a concern? Isn't low bandwidth what is slowing my downloads?
- What applications are particularly latency-sensitive?
Further Reading (Optional):
- It's the Latency, Stupid (a few years old, but still relevant today)
What Affects Network System Performance?
The performance of a network system (i.e. sending a message from computer A to computer B across a network) can be influenced by many different subsystems, including:
- Host system (CPU/Memory)
- Does the host system have sufficient processor and memory resources to support the desired application and operating system network stack?
- Host system (Interconnect)
- Does the interconnect between the host system and NIC (PCI, PCI-X, PCI Express) have sufficient bandwidth?
- What latency does the interconnect add?
- NIC
- Does the network interface card have sufficient resources to transmit or receive all packets requested of it?
- What latency does the NIC add?
- Network
- What is the raw channel capacity (bandwidth) of the network "wires"?
- Do all network devices (routers/switches/etc...) have sufficient bandwidth? Does this bandwidth vary based on the type of packet being transmitted?
- What is the raw latency of the network "wires"?
- How much latency do the routers/switches add because of processing/queueing? Does this latency vary based on the type of packet being transmitted?
- Protocols
- UDP transmits packets at the maximum rate possible by the host, regardless of any bottlenecks or packet loss downstream
- TCP attempts to regulate transmission bandwidth to avoid packet loss (or corruption) downstream
- Additive-Increase, Multiplicative-Decrease - TCP starts by transmitting at a slow rate, and quickly increases linearly. At some point, however, the network system will become saturated and packet loss will occur. Upon detecting packet loss, TCP slashes its bandwidth usage in half, retransmits the lost data, and begins increasing its transmission speed again. This creates a "sawtooth" effect when plotting the achieved bandwidth.
- Bandwidth-Delay Product - The product of a network system's bandwidth (bits/sec) and latency (sec). This metric refers to the amount of data "in transit" in the network, and can be quite large on high-bandwidth, high-latency systems such as satellite networks. The bandwidth-delay product is used to calculate how much buffering is required on the transmitting system in order to retransmit any packets that were lost in transmit.
Note that the performance constraints described above may not be symmetric. For instance, it is almost always more efficient for a computer network stack and device driver to transmit a series of packets than receive them, which allows for greater achievable transmit bandwidth. Also, packets can have asymmetric routing across the internet, and take a different path from A->B than B->A.
Performance Analysis Tools
The following tools all measure different aspects of performance, and together can form a useful (but incomplete) view of the system.
top
top provides a real-time view of a computer system and its active processes. This application can be used to determine if the CPU is saturated or memory resources exhausted, which would make it likely that the host computer is the cause of a network performance bottleneck.
Options:
- Press "1" after running to toggle between an averaged CPU view and per-CPU statistics. The per-CPU statistics are especially helpful. Imagine an 8-CPU system running a single-threaded network application. That network application could be consuming 100% of 1 CPU and thus be the performance bottleneck, but the (default) average CPU metric will report a very-misleading 87.5% idle.
- Press "q" to quit the program
Example:
shafer@comp519:~$ top top - 21:18:50 up 56 days, 4:57, 12 users, load average: 1.00, 0.97, 0.74 Tasks: 249 total, 2 running, 244 sleeping, 2 stopped, 1 zombie Cpu0 : 1.7%us, 0.0%sy, 0.0%ni, 98.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 1.5%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 1.0%us, 0.0%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 1.6%us, 0.0%sy, 0.0%ni, 98.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 1.0%us, 0.0%sy, 0.0%ni, 98.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 1.0%us, 0.1%sy, 0.0%ni, 98.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 1.5%us, 0.0%sy, 0.0%ni, 98.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 1.6%us, 0.0%sy, 0.0%ni, 98.2%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 8187372k total, 6018464k used, 2168908k free, 532236k buffers Swap: 31246344k total, 513256k used, 30733088k free, 3545432k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2690 bes4623 25 0 1353m 1.2g 29m R 100 15.2 2:59.22 par 2748 shafer 15 0 19084 1444 972 R 1 0.0 0:00.01 top 1 root 18 0 3960 264 184 S 0 0.0 0:07.96 init 2 root 10 -5 0 0 0 S 0 0.0 0:00.00 kthreadd 3 root RT -5 0 0 0 S 0 0.0 0:00.60 migration/0 4 root 34 19 0 0 0 S 0 0.0 0:00.00 ksoftirqd/0 5 root RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/0 6 root RT -5 0 0 0 S 0 0.0 0:00.03 migration/1 7 root 34 19 0 0 0 S 0 0.0 0:00.03 ksoftirqd/1 8 root RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/1 9 root RT -5 0 0 0 S 0 0.0 0:00.33 migration/2 10 root 34 19 0 0 0 S 0 0.0 0:00.01 ksoftirqd/2 11 root RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/2 12 root RT -5 0 0 0 S 0 0.0 0:05.24 migration/3 13 root 34 19 0 0 0 S 0 0.0 0:00.06 ksoftirqd/3 14 root RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/3 15 root RT -5 0 0 0 S 0 0.0 0:00.39 migration/4 16 root 34 19 0 0 0 S 0 0.0 0:00.00 ksoftirqd/4 17 root RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/4 18 root RT -5 0 0 0 S 0 0.0 0:00.05 migration/5 19 root 34 19 0 0 0 S 0 0.0 0:00.04 ksoftirqd/5
Questions:
- Will top provide useful information on every bottleneck that might affect the host computer system? What is missing?
- If you had a hypothesis that the host computer (CPU, memory, ...) was the bottleneck, what kind of experiment could you conduct to confirm that?
Ping
Ping sends a 56-byte ICMP echo packet across the network, and the receiving host sends a 56-byte ICMP echo reply response. The round-trip time (e.g latency) to complete this process is measured and displayed per ping, and statistics (minimum, maximum, average, and mean deviation) calculated over all ping packets sent.
shafer@comp519:~$ ping -c 5 rice.edu PING rice.edu (128.42.5.4) 56(84) bytes of data. 64 bytes from moe.rice.edu (128.42.5.4): icmp_seq=1 ttl=125 time=1.25 ms 64 bytes from moe.rice.edu (128.42.5.4): icmp_seq=2 ttl=125 time=1.06 ms 64 bytes from moe.rice.edu (128.42.5.4): icmp_seq=3 ttl=125 time=1.31 ms 64 bytes from moe.rice.edu (128.42.5.4): icmp_seq=4 ttl=125 time=1.07 ms 64 bytes from moe.rice.edu (128.42.5.4): icmp_seq=5 ttl=125 time=1.16 ms --- rice.edu ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 4000ms rtt min/avg/max/mdev = 1.067/1.174/1.313/0.100 ms
Examples
Pinging NetFPGA router IP:
shafer@nf-server1:~$ ping 10.143.206.65 PING 10.143.206.65 (10.143.206.65) 56(84) bytes of data. 64 bytes from 10.143.206.65: icmp_seq=1 ttl=64 time=0.973 ms 64 bytes from 10.143.206.65: icmp_seq=2 ttl=64 time=0.208 ms 64 bytes from 10.143.206.65: icmp_seq=3 ttl=64 time=0.215 ms 64 bytes from 10.143.206.65: icmp_seq=4 ttl=64 time=0.216 ms 64 bytes from 10.143.206.65: icmp_seq=5 ttl=64 time=0.207 ms --- 10.143.206.65 ping statistics --- 6 packets transmitted, 6 received, 0% packet loss, time 5002ms rtt min/avg/max/mdev = 0.201/0.336/0.973/0.285 ms
Pinging through NetFPGA router to Server2:
shafer@nf-server1:~$ ping 10.143.206.130 PING 10.143.206.130 (10.143.206.130) 56(84) bytes of data. 64 bytes from 10.143.206.130: icmp_seq=1 ttl=63 time=0.097 ms 64 bytes from 10.143.206.130: icmp_seq=2 ttl=63 time=0.095 ms 64 bytes from 10.143.206.130: icmp_seq=3 ttl=63 time=0.091 ms 64 bytes from 10.143.206.130: icmp_seq=4 ttl=63 time=0.091 ms 64 bytes from 10.143.206.130: icmp_seq=5 ttl=63 time=0.092 ms 64 bytes from 10.143.206.130: icmp_seq=6 ttl=63 time=0.091 ms --- 10.143.206.130 ping statistics --- 6 packets transmitted, 6 received, 0% packet loss, time 4997ms rtt min/avg/max/mdev = 0.091/0.092/0.097/0.012 ms
Questions:
- What network systems are included in the round-trip time that ping measures?
- Ping application processing?
- OS network stack processing?
- OS / driver processing?
- NIC processing?
- Router processing?
- Switch queueing?
- Can ping be used to measure bandwidth?
- Why is pinging the NetFPGA router slower than pinging a server on the other side of the router?
netstat
netstat can be used to display statistics from the Linux network stack. Of particular interest are statistics regarding TCP packet errors due to packets being lost or corrupted. These errors are not visible to the end-user, because TCP provides the abstract of a reliable network. Behind the scenes, however, every lost packet must be retransmitted, greatly increasing network latency. In addition, when TCP encounters data loss (even a single packet!), it assumes that the network is congested and throttles its bandwidth use accordingly, further degrading network performance.
Example:
shafer@comp519:~$ netstat -s Ip: 313592135 total packets received 18 with invalid headers 14 with invalid addresses 1133352 forwarded 0 incoming packets discarded 310573355 incoming packets delivered 315803348 requests sent out 8 dropped because of missing route 2239242 reassemblies required 1119520 packets reassembled ok 1 packet reassembles failed Icmp: 85131 ICMP messages received 325 input ICMP message failed. ICMP input histogram: destination unreachable: 1095 timeout in transit: 6076 echo requests: 2125 echo replies: 75587 1485099 ICMP messages sent 0 ICMP messages failed ICMP output histogram: destination unreachable: 1482962 time exceeded: 12 echo replies: 2125 Tcp: 34964 active connections openings 181752 passive connection openings 1280 failed connection attempts 1779 connection resets received 21 connections established 305044644 segments received 310870496 segments send out 86523 segments retransmited 1 bad segments received. 2327 resets sent Udp: 3683271 packets received 1482715 packets to unknown port received. 11 packet receive errors 3545516 packets sent UdpLite: TcpExt: 54 resets received for embryonic SYN_RECV sockets 3491 packets pruned from receive queue because of socket buffer overrun 3 ICMP packets dropped because they were out-of-window 43350 TCP sockets finished time wait in fast timer 24 time wait sockets recycled by time stamp 49 packets rejects in established connections because of timestamp 6586019 delayed acks sent 18297 delayed acks further delayed because of locked socket Quick ack mode was activated 15469 times 3498683 packets directly queued to recvmsg prequeue. 2425210898 of bytes directly received from backlog 1811226734 of bytes directly received from prequeue 154083839 packet headers predicted 571946 packets header predicted and directly queued to user 14241747 acknowledgments not containing data received 112224177 predicted acknowledgments 23 times recovered from packet loss due to fast retransmit 8618 times recovered from packet loss due to SACK data 1 bad SACKs received Detected reordering 23 times using FACK Detected reordering 32 times using SACK Detected reordering 4 times using reno fast retransmit Detected reordering 657 times using time stamp 650 congestion windows fully recovered 1974 congestion windows partially recovered using Hoe heuristic TCPDSACKUndo: 161 2337 congestion windows recovered after partial ack 5794 TCP data loss events 2 timeouts after reno fast retransmit 4343 timeouts after SACK recovery 124 timeouts in loss state 10085 fast retransmits 2141 forward retransmits 4473 retransmits in slow start 59535 other TCP timeouts TCPRenoRecoveryFail: 1 163 sack retransmits failed 30 times receiver scheduled too late for direct processing 119701 packets collapsed in receive queue due to low socket buffer 20275 DSACKs sent for old packets 11 DSACKs sent for out of order packets 63582 DSACKs received 331 connections reset due to unexpected data 693 connections reset due to early user close 291 connections aborted due to timeout IpExt: InMcastPkts: 356399 OutMcastPkts: 56443 InBcastPkts: 111965
Questions:
- If netstat is reporting TCP data loss events, how can we determine where packets might be corrupted or dropped?
- Where else might we get information on lost or corrupted packets?
netperf
netperf (http://www.netperf.org/) is a network benchmarking tool that can be used to perform a variety of tests:
- TCP and UDP unidirectional streaming bandwidth test using standard Sockets interface
- TCP and UDP request/response latency test using standard Sockets interface
netperf is split into two pieces: an application client, and application server. It is able to stream data between the two applications across the network, and communicate via an independent control connection.
Options:
- -h - Specify the host (netperf server) to connect to for testing
- -l - Specify duration (length) of test to be performed
- -t - Specify the protocol used:
- TCP_STREAM - High-bandwidth TCP streaming test from client to server. Results is bandwidth measurement (Mbit/sec)
- TCP_MAERTS - High-bandwidth TCP streaming test from server to client (MAERTS is STREAM spelled backwards)
- UDP_STREAM - High-bandwidth UDP streaming test from client to server
- UDP_MAERTS - High-bandwidth UDP streaming test from server to client
- TCP_RR - Request/response test - TCP "ping" from user-space client to user-space server, and back again. Result is not elapsed time per "ping", but rather average number of "pings" per second.
- UDP_RR - Request/response test - UDP "ping" from user-space client to user-space server, and back again
Examples:
TCP stream from server1 to server2 via NetFPGA router:
shafer@nf-server1:~$ netperf -H 10.143.206.130 -t TCP_STREAM TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.143.206.130 (10.143.206.130) port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.03 940.87
TCP request-response test from server1 to server2 via NetFPGA router:
shafer@nf-server1:~$ netperf -H 10.143.206.130 -t TCP_RR TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.143.206.130 (10.143.206.130) port 0 AF_INET Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 87380 1 1 10.02 10718.38 16384 87380
Questions:
- Why is it a bad idea to use netperf (or any other TCP streaming test) on the public internet, or public Pacific network?
- What happens if all the groups do a netperf TCP stream test from server1 to server2 at the same time? (via their own routers). What possible bottlenecks might emerge?
PCI Bus Analyzer
Imagine a product that works like Wireshark, but instead of capturing and analyzing traffic over your network, it captures and analyzes traffic over your PCI bus (or PCI-X, PCI Express, etc...). For the everyday low price of tens of thousands of dollars, such a tool can be yours!
Capabilities:
- Capture all bus traffic for analysis (capture length dependent on device memory), allowing you to view individual packets and control data moving across the bus (perfect for analyzing how the driver and NIC communicate!)
- Capture based on a trigger such as a bus read or write to a particular address
- Extensive visualization tools to view overall bus utilization and break down by traffic type
- Detection of invalid bus transfers due to protocol errors
- Initiate transfers on the PCI bus (i.e. by pretending that you are the host or a peripheral device)
NetFPGA Questions
- How could you modify / extend your own router design to gain additional information on potential router performance bottlenecks?
- How could you construct a test that would explore the performance limitations of the your software? (and the PCI interconnect between the NetFPGA board and host system)