Alexa metrics
Live Chat

Welcome to UKFast, do you have a question? Our hosting experts have the answers.

Chat Now
Sarah UKFast | Account Manager

Measuring Driver Performance in Perf

9 August 2010 by Pingu

A couple of weeks ago the Linux Kernel 2.6.35 was officially released. For me, this release hasn’t been as exciting as say, 2.6.30 but one thing that whet my appetite was the support for distributed incoming network load. But what’s the fuss all about? Here I demonstrate how spreading incoming network I/O over multiple CPUs (especially since multicore is the norm these days) will help speed up these boards.

First, a little background. Many of the consumer grade motherboards on the market use low-end NICs which under high network load can incur a substantial cost compared to enterprise grade NICs. This is because of shortcuts that have been used for getting the device onto the market.

With this in mind, lets take a closer look at what impact a bad NIC can have on Linux compared to one that has been properly optimized.

The Test Setup

Our test machine is a virtual machine running with QEMU + KVM. Configured on the VM are two network devices, an emulated rt8139 chipset device (eth1) and the newer, and hopefully more efficient virt-io paravirtualized network device (eth0).

ip address show
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
link/ether 52:54:00:73:67:73 brd ff:ff:ff:ff:ff:ff
inet brd scope global eth0
inet6 fe80::5054:ff:fe73:6773/64 scope link
valid_lft forever preferred_lft forever

3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 52:54:00:73:67:74 brd ff:ff:ff:ff:ff:ff
inet brd scope global eth1
inet6 fe80::5054:ff:fe73:6774/64 scope link
valid_lft forever preferred_lft forever

And just to be thorough, the kernel modules we’re running:

lsmod | egrep '^(8139|virtio_net)'
8139too                27638  0
8139cp                 19191  0
virtio_net             14013  0

The Benchmarking

To start we’ll take down the virt-io device and see what kind of performance we are able to obtain from the 8139 device when we give it some work to do:

ip link set dev eth0 down

To benchmark this properly requires the use of a system profiler and we have two options; Oprofile and Perf.

Perf is typically the one you should choose on newer kernels, and since test server is running Fedora 12 we’ll be using this as our profiler.

The way that profiling works is through special hardware performance counter registers on the CPUs which are utilized to obtain our statistics with very little overhead and thus causing lesser fudging of our benchmark.

The test file we’ll be downloading is a file of random data using wget on the hypervisor itself. The idea here is that we attempt to max our throughput by selecting a file on a neighbouring machine where as little network interference could effect our results. Perf will record the data which we can analyze:

perf record -af wget
--2010-08-09 14:08:22--
Connecting to connected.
HTTP request sent, awaiting response... 200 OK
Length: 419419444 (400M) [application/octet-stream]
Saving to: "bigfile.img.1"

100%[===================================================>] 419,419,444 21.0M/s   in 14s

2010-08-09 14:08:37 (27.9 MB/s) - "bigfile.img.1"

[ perf record: Woken up 15 times to write data ]
[ perf record: Captured and wrote 2.413 MB (~105447 samples) ]

What we have done here is got the profiler to monitor the entire system at the same time running the ‘wget‘ command. This has given us reference samples. The percentages that the report creates are relative to the total load the system produced, thus to get the overall load on the system at the same time we have ran mpstat to collate overall system load. These results are listed below:

02:17:35 PM  CPU      %usr   %nice    %sys...
02:17:36 PM  all      0.00    0.00    0.00...
02:17:37 PM  all      2.00    0.00    8.00...
02:17:38 PM  all      0.00    0.00   17.44...
02:17:39 PM  all      1.01    0.00   18.18...
02:17:40 PM  all      0.00    0.00   12.12...
02:17:41 PM  all      1.01    0.00    9.09...
02:17:42 PM  all      0.00    0.00    4.08...
02:17:43 PM  all      1.00    0.00   11.00...
02:17:44 PM  all      0.00    0.00   16.83...
02:17:45 PM  all      0.00    0.00    2.02...
02:17:46 PM  all      0.94    0.00    5.66...
02:17:47 PM  all      0.00    0.00    5.56...
02:17:48 PM  all      0.00    0.00    2.11...
02:17:49 PM  all      0.98    0.00   16.67...
02:17:50 PM  all      0.00    0.00    9.20...
02:17:51 PM  all      0.99    0.00    7.92...
02:17:52 PM  all      0.00    0.00    2.08...

You can see here during the run system cpu load ramped up to about 13% whilst the download took place.

The results for our perf and our 8139 module grepped out are thus listed. They give us more insight as to what is going on:

perf report | grep 8139
7.15%          swapper  [kernel] [k] cp_start_xmit        [8139cp]
4.72%          swapper  [kernel] [k] cp_interrupt [8139cp]
3.88%             wget  [kernel] [k] cp_rx_poll   [8139cp]
3.23%             wget  [kernel] [k] cp_start_xmit        [8139cp]
1.73%             wget  [kernel] [k] cp_interrupt [8139cp]
0.92%          swapper  [kernel] [k] cp_rx_poll   [8139cp]
0.17%      flush-253:0  [kernel] [k] cp_start_xmit        [8139cp]
0.12%      flush-253:0  [kernel] [k] cp_interrupt [8139cp]
0.03%             sshd  [kernel] [k] cp_start_xmit        [8139cp]
0.03%          swapper  [kernel] [k] dma_unmap_single_attrs.clone.2       [8139cp]
0.02%      flush-253:0  [kernel] [k] cp_rx_poll   [8139cp]
0.02%          swapper  [kernel] [k] dma_map_single_attrs.clone.1 [8139cp]
0.02%             sshd  [kernel] [k] cp_interrupt [8139cp]
0.01%             wget  [kernel] [k] dma_unmap_single_attrs.clone.2       [8139cp]
0.01%            ata/0  [kernel] [k] cp_start_xmit        [8139cp]
0.01%            ata/0  [kernel] [k] cp_interrupt [8139cp]
0.01%             sshd  [kernel] [k] cp_rx_poll   [8139cp]
0.01%             wget  [kernel] [k] dma_map_single_attrs.clone.1 [8139cp]
0.01%          kswapd0  [kernel] [k] cp_interrupt [8139cp]
0.01%  hald-addon-stor  [kernel] [k] cp_interrupt [8139cp]
0.01%          swapper  [kernel] [k] netif_wake_queue     [8139cp]
0.01%  hald-addon-stor  [kernel] [k] cp_start_xmit        [8139cp]
0.01%        scsi_eh_0  [kernel] [k] cp_start_xmit        [8139cp]
0.01%                X  [kernel] [k] cp_start_xmit        [8139cp]

Of the total load on the system, the 8139 driver used about 20% of the entire load. If we take our 15% system usage from before and take 20% from it this indicates that about 3% of the cpu was used handling the network traffic.

Lets take a look at virt-io. We’ll enable it and run the same test.

perf record -af wget
--2010-08-09 14:25:35--
Connecting to connected.
HTTP request sent, awaiting response... 200 OK
Length: 419419444 (400M) [application/octet-stream]
Saving to: "bigfile.img.7"

100%[================================================>] 419,419,444 59.1M/s   in 6.0s

2010-08-09 14:25:41 (66.6 MB/s) - "bigfile.img.7"

Interesting, this ran actually took half the time.

02:25:34 PM  CPU    %usr   %nice    %sys[...]
02:25:35 PM  all    4.00    0.00   23.00
02:25:36 PM  all    2.67    0.00   52.00
02:25:37 PM  all    1.23    0.00   33.33
02:25:38 PM  all    1.00    0.00   17.00
02:25:39 PM  all    0.00    0.00    2.00
02:25:40 PM  all    3.03    0.00   36.36
02:25:41 PM  all    0.00    0.00    6.93
02:25:42 PM  all    0.00    0.00    1.01

So, system load was much higher this run using virt-io. Lets check our perf results:

0.18%          swapper  [kernel] [k] virtnet_poll      [virtio_net]
0.08%             wget  [kernel] [k] virtnet_poll      [virtio_net]
0.04%          swapper  [kernel] [k] try_fill_recv     [virtio_net]
0.03%      flush-253:0  [kernel] [k] virtnet_poll      [virtio_net]
0.01%          swapper  [kernel] [k] start_xmit        [virtio_net]
0.01%          kswapd0  [kernel] [k] virtnet_poll      [virtio_net]
0.01%             wget  [kernel] [k] xmit_skb  [virtio_net]
0.01%             wget  [kernel] [k] start_xmit        [virtio_net]
0.01%             wget  [kernel] [k] try_fill_recv     [virtio_net]
0.00%          swapper  [kernel] [k] xmit_skb  [virtio_net]
0.00%          swapper  [kernel] [k] free_old_xmit_skbs        [virtio_net]
0.00%          kswapd0  [kernel] [k] free_old_xmit_skbs        [virtio_net]
0.00%      flush-253:0  [kernel] [k] free_old_xmit_skbs        [virtio_net]
0.00%      flush-253:0  [kernel] [k] start_xmit        [virtio_net]
0.00%      flush-253:0  [kernel] [k] try_fill_recv     [virtio_net]

Well, this is much better. virt-io uses about 1% of the average 25% system usage for the task, thats 0.25% of the total CPU, about 12 times more efficient!

So, what does this show us?

Well, this test would be a no-contest race anyway because on a VM like this 8139 is not paravirtualized whereas virt-io is. Virt-IO was bound to win.

But what this does demonstrate is the difference in driver implementations can broadly affect the CPU. On consumer systems especially cheap NICs reduce performance over the long term by perhaps 3-4% of the CPU. This might not seem like a lot now, but when we delve into the realms of 10G ethernet, this will start to show on more modern CPUs. Having multiple CPUs handling incoming traffic will spread this load out leaving your system free to handle other tasks – or at least not block so much which could lead to increased throughput.


This change, ultimately, will make Linux CPUs perform better in very high speed networks. With the enterprise trend beginning to move to high speed SANS using ISCSI, and perhaps further in the future Fibre Channel over Ethernet, it becomes important for system adminstrators to know where their overheads are. 10G NICs in these environments will really benefit from multi-core CPUs which by the time 10G becomes the norm, most people should be using.

As a system administrator myself, I have a keen interest in resource accounting. Measuring efficiency is important in our business and improving it without having to do much effort on my own behalf I will always welcome.