%include "default.mgp" %default 1 bgrad %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %page %nodefault %back "blue" %center %size 7 A tour of the Linux 2.6 network stack %center %size 4 by Harald Welte %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %page Linux 2.6 Network Tour Contents Introduction Hardirq Context Hard Interrupt Handler Softirq Context Network RX Softirq IPv4 Packet Handler IPv4 Packet Forwarding IPv4 Packet Output Driver TX routine %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %page Linux 2.6 Network Tour Introduction Who is speaking to you? an independent Free Software developer who earns his living off Free Software since 1997 who is one of the authors of the Linux kernel firewall system called netfilter/iptables %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %page Linux 2.6 Network Tour Interrupt context Also called 'hardirq' Triggered by external interrupt to the cpu Is not reentrant, because the irq is disabled before handler is called Should only do minimum of work and leave as fast as possible hardirq handler registered via request_irq() %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %page Linux 2.6 Network Tour Receive Interrupt NIC receives packet for local mac address NIC issues interrupt Interrupt is routed to one CPU Kernel enters hardirq context and disables this irq on local cpu Driver's interrupt handler allocates skb (struct sk_buff) calls net/core/dev.c:netif_rx() return irqreturn_t Kernel leaves hardirq context and reenables this irq 2.6.x introduces NAPI for polling at high irq rates: netif_rx_schedule() %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %page Linux 2.6 Network Tour Softirq context Softirq is the real workhorse of interrupts Continues work where hardirq has finished Can be interrupted by hardirq context Can run in parallel on any number of CPU's softirq handler registered via kernel/softirq.c:open_softirq() softirq's need to be 'raised' by raise_softirq() from hardirq softirq's are scheduled after hardirq context exits from softirqd in case there's too much work %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %page Linux 2.6 Network Tour Network RX Softirq kernel/softirq.c:do_softirq() generic softirq code net/core/dev.c:net_rx_action() function that is registered at open_softirq() time net/core/dev.c:process_backlog() dequeue skb from local CPU's backlog queue uses a weighting scheme between different devices %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %page Linux 2.6 Network Tour netif_receive_skb() net/core/dev.c:netif_receive_skb() main network rx softirq workhorse check if there are any netpoll users, if yes netpoll_rx() if somebody requested skb rx timestamp, net_timestamp() if interface is part of bound group, skb_bound() tc ingress filtering: ing_filter() packet diverter: handle_diverter() bridging handler: net/core/dev.c:handle_bridge() deliver to l3 protocol handler: net/core/dev.c:deliver_skb() %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %page Linux 2.6 Network Tour IPv4 packet handler net/ipv4/ip_input.c:ip_rcv() checksum check size check NF_IP_PRE_ROUTING netfilter hook net/ipv4/ip_input.c:ip_rcv_finish() net/ipv4/route.c/ip_route_input() route/dst cache lookup if lookup fails, ip_route_input_slow() fib lookup allocation of new dst_entry / rtable include/net/dst.h:dst_input() iterate over destination stack call destination function of the respective stack items %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %page Linux 2.6 Network Tour IPv4 packet forwarding net/ipv4/ip_forward.c:ip_forward() xfrm4_policy_check() router alert handling (ip_call_ra_chain) ttl decrement if route is redirect route, ip_rt_send_redirect() call NF_IP_FORWARD netfilter hook net/ipv4/ip_forward.c:ip_forward_finish() increase statistics for snmp mib include/net/dst.h:dst_output() iterate over output functions of dst stack %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %page Linux 2.6 Network Tour IPv4 packet output net/ipv4/ip_output.c:ip_output() fragment packet via ip_fragment() if needed net/ipv4/ip_output.c:ip_finish_output() call netfilter NF_IP_POST_ROUTING hook net/ipv4/ip_output.c:ip_finish_output2() attach hardware header call header cache output fn (if neighbour in cache) net/core/dev.c:dev_skb_xmit() or neighbour output function (if neighbour unknown) net/core/neighbour.c:neigh_resolve_output() %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %page Linux 2.6 Network Tour dev_skb_xmit() skb->dev->qdisc->enqueue() enqueue into devices output queue default: net/sched/sch_generic.c:pfifo_fast_enqueue() net/sched/sch_generic.c:qdisc_restart(): dev->qdisc->dequeue() dequeue skb from queue dev->hard_start_xmit() transmit skb via driver %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %page Linux 2.6 Network Tour Driver TX Routine drivers/net/e1000/e1000_main.c:e1000_xmit_frame() tons of workarounds for chip bugs set up TX DMA descriptor queue TX DMA descriptor to device hardware return NETDEV_TX_OK %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %page Linux 2.6 Network Tour Thanks Thanks to Alan Cox, Alexey Kuznetsov, David Miller, Andi Kleen for implementing (one of?) the world's best TCP/IP stacks Paul 'Rusty' Russell for starting the netfilter/iptables project for trusting me to maintain it today Astaro AG for sponsoring parts of my netfilter work Free Software Foundation for the GNU Project for the GNU General Public License %size 3 The slides of this presentation are available at http://www.gnumonks.org/ Further Reading %size 3 The netfilter homepage http://www.netfilter.org/ %size 3 The http://www.gpl-violations.org/ project