Rough Notes on Networking in the Linux KernelLast Updated: Monday, 28 March, 2005
1. Existing Optimizations
A great deal of thought has gone into Linux networking implementation and many
optmizations have made their way to the kernel over the years. Some prime
examples include:
NAPI - Receive interrupts are coalesced to reduce changes of a
livelock. Thus, now each packet receive does not generate an interrupt.
Required modifications to device driver interface. Has been in the
stable kernels since 2.4.20.
Zero-Copy TCP - Avoids the overhead of kernel-to-userspace and
userspace-to-kernel packet copying. http://builder.com.com/5100-6372-1044112.html describes this is
some detail.
When a packet is received, the device uses DMA to put it in main memory (let's
ignore non-DMA or non-NAPI code and drivers). An skb is
constructed by the poll() function of the device driver.
After this point, the same skb is used
throughout the networking stack, i.e., the packet is almost never copied within
the kernel (it is copied when delivered to user-space).
This design is borrowed from BSD and UNIX SVR4 - the idea is to allocate memory
for the packet only once. The packet has 4 primary pointers - head,
end, data, tail into the packet data (character buffer).
head points to the beginning of the packet - where the link
layer header starts. end points to the end of the packet.
data points to the location the current networking layer can
start reading from (i.e., it changes as the packet moves up from the link
layer, to IP, to TCP). Finally, tail is where the current
protocol layer can begin writing data to (see alloc_skb(),
which sets head, data, tail to the beginning of allocated
memory block and end to data + size).
Other implementations refer to head, end, data, tail as
base, limit, read, write respectively.
There are some instances where a packet needs to be duplicated. For example,
when running tcpdump the packet needs to be sent to the
userspace process as well as to the normal IP handler. Actually, int this case
too, a copy can be avoided since the contents of the packet are not being
modified. So instead of duplicating the packet contents,
skb_clone() is used to increase the reference count of a
packet. skb_copy() on the other hand actually duplicates
the contents of the packet and creates a completely new skb.
See also: http://oss.sgi.com/archives/netdev/2005-02/msg00125.html
A related question: When a packet is received, are the tail
and end pointers equal? Answer: NO. This is because memory
for packets received is allocated before the packet
is received, and the address and size of this memory is communicated to the
NIC using receive descriptors - so that when it is actually received the NIC
can use DMA to transfer the packet to main memory. The size allocated for a
received packet is a function of the MTU of the device. The size of an Ethernet
frame actually received could be anything less than the MTU. Thus, tail
of a received packet will point to the end of the received data while end
will point to the end of the memory allocated for the packet.
3. ICMP Ping/Pong : Function Calls
Code path (functions called) when an ICMP ping is received (and corresponding
pong goes out), for linux 2.6.9: First the packet is received by the NIC and
it's interrupt handler will ultimately call net_rx_action()
to be called (NAPI, [1]). This will call the device
driver's poll function which will submit packets
(skb's) to the networking stack via
netif_receive_skb. The rest is outlined below:
ip_rcv() -->
ip_rcv_finish() dst_input() --> skb->dst->input
= ip_local_deliver() ip_local_deliver() -->
ip_local_deliver_finish() ipprot->handler =
icmp_rcv() icmp_pointers[ICMP_ECHO].handler ==
icmp_echo() -- At this point I guess you could say that the
"receive" path is complete, the packet has reached the top. Now the
outbound (down the stack) journey begins) icmp_reply() -- Might want to look
into the checks this function does icmp_push_reply() ip_push_pending_frames() dst_output() --> skb->dst->output =
ip_output() ip_output() --> ip_finish_output() -->
ip_finish_output2() dst->neighbour->output ==
4. Transmit Interrupts and Flow Control
Transmit interrupts are generated after every packet transmission and this is
key to flow control. However, this does have significant performance
implications under heavy transmit-related I/O (imagine a packet forwarder where
the number of transmitted packets is equal to the number of received oned).
Each device provides a means to slow down transmit (Tx) interrupts. For
example, Intel's e1000 driver exposes "TxIntDelay" that allows transmit
interrupts to be delayed in units of 1.024 microseconds. The default value is
64, thus eavy under heavy transmissions an interrupt's are spaced 65.536
microseconds apart. Imagine the number of transmissions that can take place in
this time.
5. NIC driver callbacks and ifconfig
Interfaces are configured using the ifconfig command. Many
of these commands will result in a function of the NIC driver being called. For
example, ifconfig eth0 up should result in the device
driver's open() function being called
(open is a member of struct net_device).
ifconfig communicates with the kernel through
ioctl() on any socket. The requests are
a struct ifreq (see
/usr/include/net/if.h and http://linux.about.com/library/cmd/blcmdl7_netdevice.htm. Thus, ifconfig eth0 up will result in the following:
A socket (of any kind) is opened using socket()
A struct ifreq is prepared with ifr_ifname set to "eth0"
An ioctl() with request
SIOCGIFFLAGS is done to get the current flags and
then the IFF_UP and IFF_RUNNING
flags are set with another ioctl() (with request
SIOCSIFFLAGS).
Now we're inside the kernel. sock_ioctl() is called,
which in turn calls dev_ioctl() (see
net/socket.c and net/core/dev.c)
dev_ioctl() --> ... -->
dev_open() --> driver's open()
implementation.
6. Protocol Structures in the Kernel
There are various structs in the kernel which consist of
function pointers for protocol handling. Different structures correspond to
different layers of protocols as well as whether the functions are for
synchronous handling (e.g., when recv(), send() etc.
system calls are made) or asynchronous handling (e.g., when a packet
arrives at the interface and it needs to be handled). Here is what I have
gathered about the various structures so far:
struct packet_type - includes instantiations such as
ip_packet_type, ipv6_packet_type etc. These provide
low-level, asynchronos packet handling. When a packet arrives at the
interface, the driver ultimately submits it to the networking stack by
a call to netif_receive_skb(), which iterates to the
list of registered packet handlers and submits the
skb to them. For example,
ip_packet_type.func = ip_rcv, so
ip_rcv() is where one can say the IP protocol first
receives a packet that has arrived at the interface. Packet-types are
registred with the networking stack by a call to
dev_add_pack().
struct net_proto_family - includes instantiations
such as inet_family_ops, packet_family_ops etc. Each
net_proto_family structure handles one type of
address family (PF_INET etc.). This structure is
associated with a BSD socket (struct socket) and not
the networking layer representation of sockets (struct
sock). It essentially provdides a create()
function which is called in response to the socket()
system call. The implementation of create() for each
family typically allocates the struct sock and also
associates other synchronous operations (see struct
proto_ops below) with the socket. To cut a long story short -
net_proto_family provides the protocol-specific part
of the socket() system call. (NOTE: Not all BSD
sockets will have a networking socket associated with it. For example,
unix sockets (the PF_UNIX address family).
unix_family_ops.create = unix_create does not
allocate a struct sock). The
net_proto_family structure is registered with the
networking stack by a call to sock_register().
struct proto_ops - includes instantiations such as
inet_stream_ops, inet_dgram_ops, packet_ops etc.
These provide implementations of networking layer synchronous calls
(connect(), bind(), recvmsg(), ioctl() etc. system
calls). The ops member of the BSD socket structure
(struct socket) points to the
proto_ops associated with the socket. Unlike the
above two structures, there is no function that explicitly registers a
struct proto_ops with the networking stack. Instead,
the create() implementation of struct
net_proto_family just sets the ops field
of the BSD socket to the appropriate proto_ops
structure.
struct proto - includes instantiations such as
tcp_prot, udp_prot, raw_prot. These provide protocol
handlers inside a network family. It seems that currently this means
only over-IP protocols as I could find only the above three
instantiations. These also provide implementations for synchronous
calls. The sk_prot field of the networking socket
(struct sock) points to such a structure. The
sk_prot field would get set by the
create function in struct
net_proto_family and the functions provided will be called by
the implementations of functions in the struct
proto_ops structure. For example,
inet_family_ops.create = inet_create allocates a
struct sock and would set sk_prot =
udp_prot in reponse to a socket(PF_INET, SOCK_DGRAM,
0); system call. A recvfrom() system call
made on the socket would then invoke inet_dgram_ops.recvmsg =
sock_common_recvmsg, which calls sk_prot->recvmsg =
udp_recvmsg. Like proto_ops,
struct protos aren't explicitly "registered" with
the networking stack using a function, but are "regsitered" by the BSD
socket create() implementation in the
struct net_proto_family.
struct net_protocol - includes instantiations such
as tcp_protocol, udp_protocol, icmp_protocol etc.
These provide asynchronous packet receive routines for IP protocols.
Thus, this structure is specific to the inet-family of protocols.
Handlers are registered using inet_add_protocol().
This structure is used by the IP-layer routines to hand off to a layer
4 protocol. Specifically, the IP handler (ip_rcv())
will invoke ip_local_deliver_finish() for packets
that are to be delivered to the local host.
ip_local_deliver_finish() uses a hash table
(inet_protos) to decide which function to pass the
packet to based on the protocol field in the IP header. The hash table
is populated by the call to inet_add_protocol().
7. skb_clone() vs.
skb_copy()When a packet needs to be delivered to two separate handlers (for
example, the IP layer and tcpdump), then it is "cloned" by
incrementing the reference count of the packet instead of being "copied". Now,
though the two handlers are not expected to modify the packet contents, they
can change the data pointer. So, how do we ensure that
processing by one of the handlers doesn't mess up the data
pointer for the other? A. Umm... skb_clone means that
there are separate head, tail, data, end etc. pointers. The
difference between skb_copy() and
skb_clone() is precisely this - the former copies the packet
completely, while the latter uses the same packet data but separate pointers
into the packet. 8. NICs and Descriptor Rings
NOTE: Using the Intel e1000, driver source version 5.6.10.1, as an example.
Each transmission/reception has a descriptor - a "handle" used to access buffer
data somewhat like a file descriptor is a handle to access file data. The
descriptor format would be NIC dependent as the hardware understands and
reads/writes to the descriptor. The NIC maintains a circular ring of
descriptors, i.e., the number of descriptors for TX and RX is fixed
(TxDescriptors, RxDescriptors module parameters for the
e1000 kernel module) and the descriptors are used like a
circular queue.
Thus, there are three structures:
Descriptor Ring (struct e1000_desc_ring) - The
list of descriptors. So, ring[0], ring[1] etc. are individual
descriptors. The ring is typically allocated just once and thus
the DMA mapping of the ring is "consistent". Each descriptor in the
ring will thus have a fixed DMA and memory address. In the e1000,
the device registers TDBAL, TDBAH, TDLEN stand for
"Transmit Descriptors Base Address Low", "High" and "Length" (in bytes
of all descriptors). Similarly, there are RDBAL, RDBAH, RDLEN
Descriptors (struct e1000_rx_desc and
struct e1000_tx_desc) - Essentially, this stores the
DMA address of the buffer which contains actual packet data, plus some
other accounting information such as the status (transmission
successsful? receive complete? etc.), errors etc.
Buffers - Now actual data cannot have a "consistent" DMA mapping, meaning we
cannot ensure that all skbuffs for a particular device always have some
specific memory addresses (those that have been setup for DMA). Instead,
"streaming" DMA mappings need to be used. Each descriptor thus contains
the DMA address of a buffer that has been setup for streaming mapping.
The hardware uses that DMA address to pickup a packet to be sent
or to place a received packet. Once the kernel's stack picks up the
buffer, it can allocate new resources (a new buffer) and tell the NIC
to use that buffer next time by setting up a new streaming mapping
and putting the new DMA handle in the descriptor.
The e1000 uses a struct e1000_buffer as a wrapper
around the actual buffer. The DMA mapping however is setup only for
skb->data, i.e., where raw packet data is to be
placed.
9. How much networking work does the ksoftirqd do?
Consider what the NET_RX_SOFTIRQ does:
Each softirq invokation (do_softirq()) processes
up to net.core.netdev_max_backlog x MAX_SOFTIRQ_RESTART
packets, if available. The default values lead to 300 x 10 = 3000 pkts.
Every interrupt calls do_softirq()
when exitting (irq_exit()) - including the timer interrupt
and NMIs too?
Default transmit/receive ring sizes on the NIC are less than 3000 (the e1000
for example defaults to 256 and can have at most 4096 descriptors on its
ring)
Thus, the number of times ksoftirqd will be switched in/out
depends on how much processing is done by do_softirq()
invokations on irq_exit(). If the softirq handling on
interrupt is able to clean up the NIC ring faster than a new packet comes in,
then ksoftirqd won't be doing anything. Specifically, if the
inter-packet-gap is greater than the time it takes to pick-up and process a
single packet from the NIC, then ksoftirq will not be
scheduled (and if the number of descriptors on the NIC is less than 3000).
Without going into details, some quick experimental verification: Machine A
continuously generates UDP packets for Machine B which is running an "sink"
application, i.e., it just loops on a recvfrom(). When the
size of the packet sent from A was 60 bytes (and inter-packet gap averaged
1.5µs), then the ksoftirqd thread on B observed a total of
375 context swithces (374 involuntary and 1 voluntary). When the packet size
was 1280 bytes (and now inter-packet gap increased almost 7 times to 10µs) then
the ksoftirqd thread was NEVER scheduled (0 context
switches). The single voluntary context switch in the former case probably
happened after all packets were processed (i.e., the sender stopped sending and
the receiver processed all that it got).
10. Packet Requeues in Qdiscs
The queueing discipline (struct Qdisc) provides a
requeue(). Typically, packets are dequeued from the
qdisc and submitted to the device driver (the hard_start_xmit
function in struct net_device). However, at times it
is possible that the device driver is "busy", so the dequeued packet
must be "requeued". "Busy" here means that the xmit_lock
of the device was held. It seems that this lock is acquired at two places:
(1) qdisc_restart() and (2) dev_watchdog().
The former handles packet dequeueing from the qdisc, acquiring the xmit_lock
and then submitting the packet to the device driver (hard_start_xmit()) or
alternatively requeuing the packet if the xmit_lock was already held by
someone else.
The latter is invoked asynchronously, periodically - its part of the watchdog timer mechanism.
My understanding is that two threads cannot be in qdisc_restart()
for the same qdisc at the same time, however the xmit_lock may have been
acquired by the watchdog timer function causing a requeue.
This is just a dump of links that might be useful.
[3]
Beyond Softnet. Jamal Hadi Salim, Robert Olsson, and Alexey Kuznetsov. Nov 2001. USENIX. 5. . [5]
A Map of the Networking Code in Linux Kernel 2.4.20.
Miguel Rio,
Mathieu Goutelle,
Tom Kelly, Richard
Hugh-Jones, Jean-Phillippe Martin-Flatin, and
Yee-Ting Li. Mar 2004.
[4] Understanding the Linux Kernel.
Daniel P. Bovet and Marco
Cesati. O'Reilly & Associates. 2nd Edition.
81-7366-589-3.
|
|
|