Memo: Linux Networking

Memo	A. Shankar
	September 2004

Linux Networking

1. Introduction
2. Receive
    2.1 Background
        2.1.1 Direct Memory Access (DMA)
        2.1.2 Interrupt Livelock
        2.1.3 Softirqs
    2.2 NAPI
        2.2.1 Handling a flood
3. Transmission
    3.1 The device and it's driver
    3.2 Queues and Disciplines
4. Conclusions
5. More Questions
6. References
§ Author's Address

TOC

1. Introduction

This document presents my understanding of the implementation and performance issues of software routing. The questions I want to answer include:

How many threads are involved in the networking stack?
How long does a received packet spend in the system without being processed?
How long does a packet to be sent wait before going out on the interface?

Due to easy access to documentation and source, focus is on the the Linux implementation.

We start by tracing the lifeline of a received packet, followed it by the lifeline of a packet to be transmitted and finally combine two to answer the questions above.

The focus is on the networking layer and below. TCP/UDP is not looked at (in particular, the TCP implementation will have some other thread(s) for handling retransmissions etc. We ignore those for the moment).

TOC

2. Receive

2.1 Background

A very high-level view is that packet reception causes the network interface device to raise an interrupt, in response to which the interrupt handler copies the packet from the device to main memory and then processes it (looks at headers etc.). Packet handling itself should (and is) not done in the interrupt context (i.e., when some interrupts may be masked), but done upon return from the interrupt context (i.e., when all interrupts are enabled). Linux uses "softirqs" for this, details of which are described in Section 2.1.3Softirqs.

2.1.1 Direct Memory Access (DMA)

There are some bottlenecks in this high-level view. For example, copying the packet from the device's buffers to main memory. Instead, the device can use DMA to copy the packet into memory and then interrupt the processor only when the packet is "accessible". This is precisely what happens, each device driver maintains a DMA ring (circular buffer) and the device interrupts the processor only when the packet is ready in memory.

On interface initialization (i.e., when it is brought "up" which means when the driver's "open()" function is called), the driver sets up and initializes a DMA ring buffer. When a packet is received, it is copied into main memory (one of the slots in the DMA ring buffer) and an interrupt is raised only after the packet is accessible to the kernel.

2.1.2 Interrupt Livelock

Another problem is what is commonly referred to as a "livelock". If packets are being received fast enough, the kernel never gets to process them because interrupts are being generated too fast, i.e., the CPU spends 100% of its time in interrupt handling. A "fix" to this problem (and some others) was described in [2]Salim, J., Olsson, R. and A. Kuznetsov, Beyond Softnet, November 2001. and is referred to as NAPI (New API). The changes suggested in this paper were included in Linux 2.4.20 and to the best of my knowledge, this is what happens in kernels between 2.4.20 and 2.6.8. The rest of this section thus talks about Linux networking code using NAPI. Note that not all drivers use the NAPI interface, for example in 2.4.23 it seems only the tg3 driver uses NAPI. Kernel 2.6.6 has more NAPI drivers such as b44, 8139cp and typhoon (source in drivers/net). Older (non-NAPI) drivers are still supported and there is a backward compatibility "hack", but we will not be concerned with the old methodology (called softnet).

2.1.3 Softirqs

In order to keep the kernel response time small, interrupts must be disabled for as little time as possible. To achieve this, the tasks that need to be done upon an interrupt are divided into two - critical or "top-half" and "deferrable". Deferrable tasks execute with all interrupts enabled.

Softirqs is how Linux implements deferrable tasks. (This section is essentially a summary of Chapter 4 in [3]Bovet, D. and M. Cesati, Understanding the Linux Kernel, .). There are a limited number of softirqs (the kernel code allows for 32 but only 4 seem to be used. These different softirqs are called "types"). The properties of a softirq are:

No softirq can be interrupted to run another softirq on the same CPU
Softirqs (of any type) can run concurrently on multiple CPUs

When a network interface interrupts on packet reception, a softirq is scheduled. The softirq code is then executed when the interrupt handler finishes.

The softirq code type for received network packets is NET_RX_SOFTIRQ and the handling code is in the function net_rx_action().

Typically, the sequence of events is something like this:

device generates interrupt
interrupt handler schedules softirq
interrupt handler exits (and now all interrupts are enabled)
a check is made for all scheduled softirqs (remember that there are other softirqs besides NET_RX_SOFTIRQ) and their handlers are called. This is done by do_softirq()

This background should be enough for now, some finer details will be discussed later.

2.2 NAPI

Here we describe some of details of NAPISalim, J., Olsson, R. and A. Kuznetsov, Beyond Softnet, November 2001.[2]. One of the primary goals of NAPI was to prevent livelocks and it does so by "adaptive interrupt coalescing". The idea is to use a mixture of interrupts and polling - i.e., at times device interrupts are disabled and packets are collected by polling instead. This reduces the chances of a livelock.

As mentioned earlier, the model of DMA was that the network interface device would interrupt the processor when the packet has been copied into main memory. A per-packet or per-fixed-number-of-packets interrupt helps in creating livelocks. NAPI proposes the following change: the device interrupts the processor upon reception of the "first" packet. The interrupt handler adds the device to a poll list, letting the kernel know that there is some work to be done on the device. The driver then disables further interrupts caused by a new incoming packet or by running out of buffers. Clearly, this practice reduces the number of interrupts generated by the device. It also leads to silent packet drops upon signs of overload (when the buffers are full), which is another design goal of NAPI.

The situation at this point is this - the kernel knows that there is a device with "some" packets on it. The device itself will not trouble with interrupts as more packets arrive.

When the softirq is activated, all devices that registered themselves are polled for packets. Note that the list of devices to be polled is a per-CPU list. So each CPU has a mutually exclusive set of devices to poll.

The polling functionality itself is implemented by the driver (poll function pointer in struct net_device). The driver's poll() function is called by net_rx_action() asking it to process received packets and setting a limit on the total number of packets that it can process (to ensure fairness amongst interfaces). If all outstanding packets have been processed, then the driver re-enables receive interrupts. Earlier we mentioned that upon receipt of the "first" packet, the driver disables receive-related interrupts from the device. What this really means is that if a receive generates an interrupt then it will generate no more until explicitly re-enabled. Interrupts will be re-enabled only after the kernel gets around to processing the packets already received.

Documentation included with the Linux kernel source[4], Linux kernel source: Documentation/networking/NAPI_HOWTO.txt, . describes the NAPI interface in greater detail.

The net effect of NAPI is that under low loads it converges to an interrupt driven scheme (where "low load" gives the kernel enough time to process a packet before the next one arrives) and under high load the ratio of interrupts to packets is reduced. Results in [2]Salim, J., Olsson, R. and A. Kuznetsov, Beyond Softnet, November 2001. show how the scheme reduces latency and increases throughput (ratio of outgoing packets to incoming packets in a packet forwarder). Although we have not discussed this here, NAPI also prevents packet re-ordering which was possible with the pre-NAPI (softnet) kernels.

Thus, NAPI limits the interruption rate (this can be seen as adaptive interrupt coalescing) and prevents livelocks. However, since a device is handled by at most one CPU at a time, it does limit the parallelism possible with a single interface.

2.2.1 Handling a flood

Suppose there is a flood of incoming packets. NAPI's adaptive interrupt coalescing prevents a flood of interrupts. If packets are coming faster than the softirq can process them, i.e., the device's DMA ring is full, then the device silently drops the new packets.

So far so good. The kernel seems to be processing packets as fast as it can. However, remember that softirqs (and thus net_rx_action(), the driver's poll() function etc.) are invoked upon return from an interrupt handler. With this infrastructure, the kernel will just try to keep up with packet processing and user level code will never get to the CPU. Thus, user level processes, if any, will starve.

This is where a special kernel thread (ksoftirqd_CPUn) comes in, one thread for each CPU. Let's look at how softirqs come into being again. When interrupt handling finishes (the do_IRQ() function) then do_softirq() is called. This function simply looks at pending softirq's and calls their handlers (NET_RX_SOFTIRQ is a softirq and net_rx_action() is it's corresponding handler). The do_softirq() function ensures that it calls a particular softirq handler at most once. For example, after net_rx_action() returns if do_softirq() notices that there is another NET_RX_SOFTIRQ pending then it does NOT call net_rx_action() again. Instead it wakes up a low priority thread - ksoftirqd. ksoftirqd runs in an infinite loop, checking if there is any pending softirq and if so executes it's handler.

The net_rx_action() handler processes at most netdev_max_backlog packets in one invocation. In kernels 2.4.23 and 2.6.8 for example, netdev_max_backlog is set to 300 by default. This value can be changed using /proc/sys/net/core/netdev_max_backlog. If there are more packets to be handled, then net_rx_action() schedules another NET_RX_SOFTIRQ. The softirq schedules itself.

Thus, when net_rx_action() returns after handling 300 or so packets, do_softirq() notices that NET_RX_ACTION has been scheduled again. Instead of trying to process those packets it wakes up ksoftirqd. Now, the remaining packets will be processed when ksoftirqd gets CPU time.

ksoftirqd is a low priority thread, so if there are user processes waiting for CPU time, they will get it before ksoftirqd. This mechanism ensures that under heavy network traffic, users of a system don't get frustrated. On the other hand, if the system acts as a router then there should not be many user processes wanting CPU time and hence ksoftirqd runs all the time.

In short, under heavy traffic, all seems under control. The kernel tries to keep up with the incoming packet rate without starving user processes or being interrupted by the hardware too much.

TOC

3. Transmission

Packet transmission seems to be much simple that reception. In this section we take a bottom-up approach - starting from the device we move up to how the network layer manages queues of packets to be sent.

3.1 The device and it's driver

DMA during transmission works as a "streaming" mapping instead of a "consistent" mapping (see Chapter 13 in [5]Rubini, A. and J. Corbet, Linux Device Drivers, .). In the case of the receive buffers, the driver can provide the device with a fixed set of addresses where it should place incoming packets, thus the DMA mapping is fixed for the lifetime of the driver (it is "consistent"). This works because the number of incoming packets is limited by the device. A system can however generate packets for transmission faster than the device can handle. Hence, when the device driver is given a packet to send, it first sets up a DMA mapping (mapping on a per-operation basis is called "streaming") and then instructs the hardware to start sending.

The transmission routine simply sets up the DMA mapping and some registers on the hardware and exits. It does not wait for the device to physically complete transmission. Concurrent accesses to the driver's transmission function (hard_start_xmit()) are prevented by a spinlock (xmit_lock).

The device can only handle a limited number of outstanding packets. Thus, calls to the transmission function are throttled - when the transmission ring is full, no more calls to the driver's transmission function are made. The "queue" of packets is "stopped" (netif_stop_queue()). When the device is ready for more, then the queue is woken up (netif_wake_queue()) and packets in the queue can be sent. The NET_TX_SOFTIRQ softirq is used to achieve this. When the packet transmission code sees that the device's queue is stopped, it schedules a NET_TX_SOFTIRQ. As in the case of the receive path, the ksoftirqd thread plays a role here.

The story so far is thus - the device driver provides a function which instructs the hardware to begin transmission. This function is protected from concurrent accesses by a spinlock. Under heavy load (when the hardware is unable to keep up), packet transmission may be deferred to the low priority ksoftirqd thread.

The device driver also has a timeout mechanism to react to occasional hardware faults (such as losing an interrupt). For this the driver provides a timeout callback function and the kernel implements a watchdog timer (dev_watchdog()). However, we will not concern ourselves with the timeouts at this time.

3.2 Queues and Disciplines

So far we have seen what happens when the device driver is instructed to send a packet. But how to packets get to this stage?

Let's consider a packet ready to send, present in the network (IP) layer. It could have reached here either because of higher layers (TCP/UDP) or it is a packet that was received and needs to be forwarded. Some higher level protocols such as UDP/ICMP are fairly simple and the packet reaches the IP layer (ip_queue_xmit()) almost "immediately", i.e., only a few function calls lie in between. TCP on the other hand is slightly more complicated with its congestion window, retransmission timers etc. So we ignore TCP for now and see virtually no latency/delay between the time the packet was created or a decision was made to forward it and the time the IP layer sees it. (NOTE: The exact sequence of function calls may be slightly tedious to trace, but looks something like this: ip_queue_xmit() --> ip_output() --> ip_finish_output() --> dev_queue_xmit()).

The link layer is where some complexity sets in because the device driver may not be invoked to send out the packet immediately, instead the packet may be queued for later transmission.

Each device maintains a queue of packets to be transmitted. How a packet is selected from this queue is known as the "queuing discipline" (and causes many variables and functions in the Linux kernel to be named "qdisc..."). The default qdisc is FIFO, but Linux allows other sophisticated schemes such as RED (Random Early Detection?), CBQ and others.

When the link layer transmission function (dev_queue_xmit()) is called, it enqueues the packet according to the queuing discipline. It then dequeues a packet (the queueing discipline will select the "best" packet to send now) and invokes the device driver's transmit function (dev_queue_xmit() --> qdisc_run() --> qdisc_restart() --> dev->hard_start_xmit()). In case of any problems, such as the device being locked by a thread on another CPU (the xmit_lock spinlock) or the device not being in a position to accept more packets, a softirq (NET_TX_SOFTIRQ) is scheduled and packet transmission will happen at a later time.

Packet transmission will be tried again when the softirq is run (ksoftirqd thread) or when there is an attempt to send another packet and the device is free at that time.

In short, packets to be sent get queued and are sent out "over the wire" by the ksoftirqd thread if immediate sending is not possible.

TOC

4. Conclusions

Based on the discussion so far, we now look try to answer the questions posed at the beginning in Section 1Introduction.

How many threads are involved in the networking stack (IP and below)? - A per-CPU ksoftirqd thread and a per-interface watchdog timer thread. If the packet rate is very low, then the threads do not get involved: receive handling happens just right after the interrupt handler returns and transmission happens immediately. If the rates are very high, then most receive processing and sends are handled by the ksoftirqd thread. The watchdog timer just restarts transmissions in case there were some hardware errors.
How long does a received packet spend in the system without being processed? - Assuming there are no other processes to run, processing of received packets is delayed only by interrupts and packet transmissions. With NAPI interrupt coalescing, receive interrupts are limited and occur only when the kernel is ready to receive and process more. Packet transmissions on the other hand may cause many interrupts. Will have to investigate further to find out how much interference these interrupts cause. Remember that packet transmissions are throttled depending on the size of the send buffer of the device.
How long does a packet to be sent wait before going out on the interface? - Packet transmissions are throttled based on the number of outstanding transmissions the device can handle. Thus, packet transmissions are limited by the device. When the device is ready to send more packets it will generate an interrupt and the next packet to be sent will at worst have to wait for some received packets to be processed (netdev_max_backlog packets at most). This is because packet transmissions and receive processing are both done by a softirq and two softirqs cannot run on the same CPU at the same time.

TOC

5. More Questions

Some questions on the "TODO" list:

NAPI can save on tons of receive interrupts. But for transmission, there is still one interrupt per successful send? If so, in a forwarder, the number and rate of incoming and outgoing packets should be nearly equal and if livelocks were caused due to per-packet receive interrupts then don't per-packet transmission related interrupts have an adverse affect as well?
Some more details on the transmission watchdog timer. How frequently do retransmissions need to take place?

TOC

6 References

[1]	Rio, M., Goutelle, M., Kelly, T., Hughes-Jones, R., Martin-Flatin, J. and Y. Li, "A Map of the Networking Code in Linux Kernel 2.4.20", March 2004.
[2]	Salim, J., Olsson, R. and A. Kuznetsov, "Beyond Softnet", November 2001.
[3]	Bovet, D. and M. Cesati, "Understanding the Linux Kernel", ISBN 81-7366-589-3, Publisher O'Reilly Associates. 2nd Edition.
[4]	"Linux kernel source: Documentation/networking/NAPI_HOWTO.txt".
[5]	Rubini, A. and J. Corbet, "Linux Device Drivers", ISBN 0-59600-008-1, Publisher O'Reilly Associates. 2nd Edition. June 2001.

TOC

Author's Address

	Asim Shankar
EMail:	shankar@uiuc.edu

GeoCitesSites.com

Table of Contents