Table of Contents
Direct Memory Access (DMA)
Handling a flood
The device and it's driver
Queues and Disciplines
This document presents my understanding of the implementation and
performance issues of software routing. The questions I want to answer
- How many threads are involved in the networking stack?
- How long does a received packet spend in the system without
- How long does a packet to be sent wait before going out on
Due to easy access to documentation and source, focus is on the the Linux
We start by tracing the lifeline of a received packet, followed it by the
lifeline of a packet to be transmitted and finally combine two to answer the
The focus is on the networking layer and below. TCP/UDP is not looked at
(in particular, the TCP implementation will have some other thread(s) for
handling retransmissions etc. We ignore those for the moment).
A very high-level view is that packet reception causes the network
interface device to raise an interrupt, in response to which the
interrupt handler copies the packet from the device to main memory
and then processes it (looks at headers etc.). Packet handling itself
should (and is) not done in the interrupt context (i.e., when some
interrupts may be masked), but done upon return from the interrupt
context (i.e., when all interrupts are enabled). Linux uses "softirqs"
for this, details of which are described in Section 2.1.3Softirqs.
2.1.1 Direct Memory Access (DMA)
There are some bottlenecks in this high-level view. For
example, copying the packet from the device's buffers to main
memory. Instead, the device can use DMA to copy the packet
into memory and then interrupt the processor only when the
packet is "accessible". This is precisely what happens, each
device driver maintains a DMA ring (circular buffer) and the
device interrupts the processor only when the packet is ready
On interface initialization (i.e., when it is brought "up"
which means when the driver's "open()" function is called),
the driver sets up and initializes a DMA ring buffer. When a
packet is received, it is copied into main memory (one of the
slots in the DMA ring buffer) and an interrupt is raised only
after the packet is accessible to the kernel.
2.1.2 Interrupt Livelock
Another problem is what is commonly referred to as a
"livelock". If packets are being received fast enough, the
kernel never gets to process them because interrupts are being
generated too fast, i.e., the CPU spends 100% of its time in
interrupt handling. A "fix" to this problem (and some others)
was described in Salim, J., Olsson, R. and A. Kuznetsov, Beyond Softnet, November 2001. and is referred to as
NAPI (New API). The changes suggested in this paper were
included in Linux 2.4.20 and to the best of my knowledge, this
is what happens in kernels between 2.4.20 and 2.6.8. The rest
of this section thus talks about Linux networking code using
NAPI. Note that not all drivers use the NAPI interface, for
example in 2.4.23 it seems only the tg3 driver uses NAPI.
Kernel 2.6.6 has more NAPI drivers such as b44, 8139cp and
typhoon (source in drivers/net). Older (non-NAPI) drivers are
still supported and there is a backward compatibility "hack",
but we will not be concerned with the old methodology (called
In order to keep the kernel response time small, interrupts
must be disabled for as little time as possible. To achieve
this, the tasks that need to be done upon an interrupt are
divided into two - critical or "top-half" and "deferrable".
Deferrable tasks execute with all interrupts enabled.
Softirqs is how Linux implements deferrable tasks. (This
section is essentially a summary of Chapter 4 in Bovet, D. and M. Cesati, Understanding the Linux Kernel, .). There are a limited number of softirqs (the
kernel code allows for 32 but only 4 seem to be used. These
different softirqs are called "types"). The properties of a
- No softirq can be interrupted to run another softirq
on the same CPU
- Softirqs (of any type) can run concurrently on
When a network interface interrupts on packet reception, a
softirq is scheduled. The softirq code is then executed when
the interrupt handler finishes.
The softirq code type for received network packets is
NET_RX_SOFTIRQ and the handling code is in the function
Typically, the sequence of events is something like
- device generates interrupt
- interrupt handler schedules softirq
- interrupt handler exits (and now all interrupts are
- a check is made for all scheduled softirqs (remember that
there are other softirqs besides NET_RX_SOFTIRQ) and their
handlers are called. This is done by do_softirq()
This background should be enough for now, some finer
details will be discussed later.
Here we describe some of details of NAPISalim, J., Olsson, R. and A. Kuznetsov, Beyond Softnet, November 2001.. One of the primary goals of NAPI was
to prevent livelocks and it does so by "adaptive interrupt
coalescing". The idea is to use a mixture of interrupts and
polling - i.e., at times device interrupts are disabled and
packets are collected by polling instead. This reduces the
chances of a livelock.
As mentioned earlier, the model of DMA was that the network
interface device would interrupt the processor when the packet
has been copied into main memory. A per-packet or
per-fixed-number-of-packets interrupt helps in creating
livelocks. NAPI proposes the following change: the device
interrupts the processor upon reception of the "first" packet.
The interrupt handler adds the device to a poll list, letting
the kernel know that there is some work to be done on the
device. The driver then disables further interrupts caused by
a new incoming packet or by running out of buffers. Clearly,
this practice reduces the number of interrupts generated by
the device. It also leads to silent packet drops upon signs of
overload (when the buffers are full), which is another design
goal of NAPI.
The situation at this point is this - the kernel knows that
there is a device with "some" packets on it. The device itself
will not trouble with interrupts as more packets arrive.
When the softirq is activated, all devices that registered
themselves are polled for packets. Note that the list of
devices to be polled is a per-CPU list. So each CPU has a
mutually exclusive set of devices to poll.
The polling functionality itself is implemented by the
driver (poll function pointer in struct net_device). The
driver's poll() function is called by net_rx_action() asking
it to process received packets and setting a limit on the
total number of packets that it can process (to ensure
fairness amongst interfaces). If all outstanding packets have
been processed, then the driver re-enables receive interrupts.
Earlier we mentioned that upon receipt of the "first" packet,
the driver disables receive-related interrupts from the
device. What this really means is that if a receive generates
an interrupt then it will generate no more until explicitly
re-enabled. Interrupts will be re-enabled only after the kernel
gets around to processing the packets already received.
Documentation included with the Linux kernel source, Linux kernel source: Documentation/networking/NAPI_HOWTO.txt, . describes the NAPI interface
in greater detail.
The net effect of NAPI is that under low loads it converges
to an interrupt driven scheme (where "low load" gives the
kernel enough time to process a packet before the next one
arrives) and under high load the ratio of interrupts to
packets is reduced. Results in Salim, J., Olsson, R. and A. Kuznetsov, Beyond Softnet, November 2001. show how
the scheme reduces latency and increases throughput (ratio of
outgoing packets to incoming packets in a packet forwarder).
Although we have not discussed this here, NAPI also prevents
packet re-ordering which was possible with the pre-NAPI
Thus, NAPI limits the interruption rate (this can be seen as
adaptive interrupt coalescing) and prevents livelocks.
However, since a device is handled by at most one CPU at a
time, it does limit the parallelism possible with a single
2.2.1 Handling a flood
Suppose there is a flood of incoming packets. NAPI's
adaptive interrupt coalescing prevents a flood of interrupts.
If packets are coming faster than the softirq can process
them, i.e., the device's DMA ring is full, then the device
silently drops the new packets.
So far so good. The kernel seems to be processing packets
as fast as it can. However, remember that softirqs (and thus
net_rx_action(), the driver's poll() function etc.) are
invoked upon return from an interrupt handler. With this
infrastructure, the kernel will just try to keep up with
packet processing and user level code will never get to the
CPU. Thus, user level processes, if any, will starve.
This is where a special kernel thread (ksoftirqd_CPUn)
comes in, one thread for each CPU. Let's look at how softirqs
come into being again. When interrupt handling finishes (the
do_IRQ() function) then do_softirq() is called. This function
simply looks at pending softirq's and calls their handlers
(NET_RX_SOFTIRQ is a softirq and net_rx_action() is it's
corresponding handler). The do_softirq() function ensures
that it calls a particular softirq handler at most once. For
example, after net_rx_action() returns if do_softirq() notices
that there is another NET_RX_SOFTIRQ pending then it does NOT
call net_rx_action() again. Instead it wakes up a low priority
thread - ksoftirqd. ksoftirqd runs in an infinite loop,
checking if there is any pending softirq and if so executes
The net_rx_action() handler processes at most
netdev_max_backlog packets in one invocation. In kernels
2.4.23 and 2.6.8 for example, netdev_max_backlog is set to 300
by default. This value can be changed using
/proc/sys/net/core/netdev_max_backlog. If there are more
packets to be handled, then net_rx_action() schedules another
NET_RX_SOFTIRQ. The softirq schedules itself.
Thus, when net_rx_action() returns after handling 300 or so
packets, do_softirq() notices that NET_RX_ACTION has been
scheduled again. Instead of trying to process those packets
it wakes up ksoftirqd. Now, the remaining packets will
be processed when ksoftirqd gets CPU time.
ksoftirqd is a low priority thread, so if there are user
processes waiting for CPU time, they will get it before
ksoftirqd. This mechanism ensures that under heavy network
traffic, users of a system don't get frustrated. On the other
hand, if the system acts as a router then there should not be
many user processes wanting CPU time and hence ksoftirqd runs
all the time.
In short, under heavy traffic, all seems under control. The
kernel tries to keep up with the incoming packet rate without
starving user processes or being interrupted by the hardware
Packet transmission seems to be much simple that reception. In this section
we take a bottom-up approach - starting from the device we move up to how the
network layer manages queues of packets to be sent.
3.1 The device and it's driver
DMA during transmission works as a "streaming" mapping instead of a
"consistent" mapping (see Chapter 13 in Rubini, A. and J. Corbet, Linux Device Drivers, .). In the
case of the receive buffers, the driver can provide the device with a
fixed set of addresses where it should place incoming packets, thus
the DMA mapping is fixed for the lifetime of the driver (it is
"consistent"). This works because the number of incoming packets is
limited by the device. A system can however generate packets for
transmission faster than the device can handle. Hence, when the device
driver is given a packet to send, it first sets up a DMA mapping (mapping
on a per-operation basis is called "streaming") and then instructs the
hardware to start sending.
The transmission routine simply sets up the DMA mapping and some
registers on the hardware and exits. It does not wait for the device
to physically complete transmission. Concurrent accesses to the
driver's transmission function (hard_start_xmit()) are prevented by a
The device can only handle a limited number of outstanding packets.
Thus, calls to the transmission function are throttled - when the
transmission ring is full, no more calls to the driver's transmission
function are made. The "queue" of packets is "stopped"
(netif_stop_queue()). When the device is ready for more, then the queue
is woken up (netif_wake_queue()) and packets in the queue can be sent.
The NET_TX_SOFTIRQ softirq is used to achieve this. When the packet
transmission code sees that the device's queue is stopped, it
schedules a NET_TX_SOFTIRQ. As in the case of the receive path, the
ksoftirqd thread plays a role here.
The story so far is thus - the device driver provides a function
which instructs the hardware to begin transmission. This function is
protected from concurrent accesses by a spinlock. Under heavy load
(when the hardware is unable to keep up), packet transmission may be
deferred to the low priority ksoftirqd thread.
The device driver also has a timeout mechanism to react to
occasional hardware faults (such as losing an interrupt). For this the
driver provides a timeout callback function and the kernel implements
a watchdog timer (dev_watchdog()). However, we will not concern
ourselves with the timeouts at this time.
3.2 Queues and Disciplines
So far we have seen what happens when the device driver is
instructed to send a packet. But how to packets get to this stage?
Let's consider a packet ready to send, present in the network (IP)
layer. It could have reached here either because of higher layers
(TCP/UDP) or it is a packet that was received and needs to be
forwarded. Some higher level protocols such as UDP/ICMP are fairly
simple and the packet reaches the IP layer (ip_queue_xmit()) almost
"immediately", i.e., only a few function calls lie in between. TCP on
the other hand is slightly more complicated with its congestion
window, retransmission timers etc. So we ignore TCP for now and see
virtually no latency/delay between the time the packet was created or
a decision was made to forward it and the time the IP layer sees it.
(NOTE: The exact sequence of function calls may be slightly tedious to
trace, but looks something like this: ip_queue_xmit() --> ip_output()
--> ip_finish_output() --> dev_queue_xmit()).
The link layer is where some complexity sets in because the device
driver may not be invoked to send out the packet immediately, instead
the packet may be queued for later transmission.
Each device maintains a queue of packets to be transmitted. How a
packet is selected from this queue is known as the "queuing
discipline" (and causes many variables and functions in the Linux
kernel to be named "qdisc..."). The default qdisc is FIFO, but Linux
allows other sophisticated schemes such as RED (Random Early
Detection?), CBQ and others.
When the link layer transmission function (dev_queue_xmit()) is
called, it enqueues the packet according to the queuing discipline. It
then dequeues a packet (the queueing discipline will select the "best"
packet to send now) and invokes the device driver's transmit function
(dev_queue_xmit() --> qdisc_run() --> qdisc_restart() -->
dev->hard_start_xmit()). In case of any problems, such as the device
being locked by a thread on another CPU (the xmit_lock spinlock) or
the device not being in a position to accept more packets, a softirq
(NET_TX_SOFTIRQ) is scheduled and packet transmission will happen at a
Packet transmission will be tried again when the softirq is run
(ksoftirqd thread) or when there is an attempt to send another packet
and the device is free at that time.
In short, packets to be sent get queued and are sent out "over the
wire" by the ksoftirqd thread if immediate sending is not possible.
Based on the discussion so far, we now look try to answer the questions
posed at the beginning in Section 1Introduction.
- How many threads are involved in the networking stack (IP and
below)? - A per-CPU ksoftirqd thread and a per-interface watchdog
If the packet rate is very low, then the threads do not get involved:
receive handling happens just right after the interrupt handler returns
and transmission happens immediately. If the rates are very high, then
most receive processing and sends are handled by the ksoftirqd thread.
The watchdog timer just restarts transmissions in case there were some
- How long does a received packet spend in the system without being
processed? - Assuming there are no other processes to run, processing
of received packets is delayed only by interrupts and packet
transmissions. With NAPI interrupt coalescing, receive interrupts are
limited and occur only when the kernel is ready to receive and process
more. Packet transmissions on the other hand may cause many
interrupts. Will have to investigate further to find out how much
interference these interrupts cause. Remember that packet
transmissions are throttled depending on the size of the send buffer
of the device.
- How long does a packet to be sent wait before going out on the
interface? - Packet transmissions are throttled based on the number of
outstanding transmissions the device can handle. Thus, packet
transmissions are limited by the device. When the device is ready to
send more packets it will generate an interrupt and the next packet to
be sent will at worst have to wait for some received packets to be
processed (netdev_max_backlog packets at most). This is because packet
transmissions and receive processing are both done by a softirq and
two softirqs cannot run on the same CPU at the same time.
5. More Questions
Some questions on the "TODO" list:
- NAPI can save on tons of receive interrupts. But for transmission,
there is still one interrupt per successful send? If so, in a
forwarder, the number and rate of incoming and outgoing packets should
be nearly equal and if livelocks were caused due to per-packet receive
interrupts then don't per-packet transmission related interrupts have
an adverse affect as well?
- Some more details on the transmission watchdog timer. How
frequently do retransmissions need to take place?
||Rio, M., Goutelle, M., Kelly, T., Hughes-Jones, R., Martin-Flatin, J. and Y. Li, "A Map of the Networking Code in Linux Kernel
2.4.20", March 2004.|
||Salim, J., Olsson, R. and A. Kuznetsov, "Beyond Softnet", November 2001.|
||Bovet, D. and M. Cesati, "Understanding the Linux Kernel", ISBN 81-7366-589-3, Publisher O'Reilly Associates. 2nd Edition.|
||"Linux kernel source:
||Rubini, A. and J. Corbet, "Linux Device Drivers", ISBN 0-59600-008-1, Publisher O'Reilly Associates. 2nd Edition. June 2001.|