\documentclass{article} \usepackage{german} \usepackage{fancyheadings} \usepackage{a4} \setlength{\oddsidemargin}{0in} \setlength{\evensidemargin}{0in} \setlength{\topmargin}{0.0in} \setlength{\headheight}{0in} \setlength{\headsep}{0in} \setlength{\textwidth}{6.5in} \setlength{\textheight}{9.5in} \setlength{\parindent}{0in} \setlength{\parskip}{0.05in} \begin{document} \title{Linux 2.4.x netfilter/iptables firewalling internals} \author{Harald Welte\\ laforge@gnumonks.org\\ \copyright{}2002 H. Welte} \date{25. April 2002} \maketitle \setcounter{section}{0} \setcounter{subsection}{0} \setcounter{subsubsection}{0} \section{Introduction} The Linux 2.4.x kernel series has introduced a totally new kernel firewalling subsystem. It is much more than a plain successor of ipfwadm or ipchains. The netfilter/iptables project has a very modular design and it's sub-projects can be split in several parts: netfilter, iptables, connection tracking, NAT and packet mangling. While most users will already have learned how to use the basic functions of netfilter/iptables in order to convert their old ipchains firewalls to iptables, there's more advanced but less used functionality in netfilter/iptables. The presentation covers the design principles behind the netfilter/iptables implementation. This knowledge enables us to understand how the individual parts of netfilter/iptables fit together, and for which potential applications this is useful. \section{Internal netfilter/iptables architecture} \subsection{Netfilter hooks in protocol stacks} One of the major motivations behind the redesign of the linux packet filtering and NAT system during the 2.3.x kernel series was the widespread firewall specific code parts within the core IPv4 stack. Ideally the core IPv4 stack (as used by regular hosts and routers) shouldn't contain any firewalling specific code, resulting in no unwanted interaction and less code complexity. This desire lead to the invention of {\it netfilter}. \subsubsection{Architecture of netfilter} Netfilter is basically a system of callback functions within the network stack. It provides a non-portable API towards in-kernel networking extensions. What we call {\it netfilter hook} is a well-defined call-out point within a layer three protocol stack, such as IPv4, IPv6 or DECnet. Any layer three network stack can define an arbitrary number of hooks, usually placed at strategic points within the packet flow. Any other kernel code can now subsequently register callback functions for any of these hooks. As in most sytems will be more than one callback function registered for a particular hook, a {\it priority} is specified upon registration of the callback function. This priority defines the order in which the individual callback functions at a particular hook are called. The return value of any registered callback functions can be: \begin{itemize} \item {\bf NF\_ACCEPT}: continue traversal as usual \item {\bf NF\_DROP}: drop the packet; do not continue traversal \item {\bf NF\_STOLEN}: callback function has taken over the packet; do not continue \item {\bf NF\_QUEUE}: enqueue the packet to userspace \item {\bf NF\_REPEAT}: call this hook again \end{itemize} \subsubsection{Netfilter hooks within IPv4} The IPv4 stack provides five netfilter hooks, which are placed at the following peculiar places within the code: \begin{verbatim} --->[1]--->[ROUTE]--->[3]--->[4]---> | ^ | | | [ROUTE] v | [2] [5] | ^ | | v | local processes \end{verbatim} Packets received on any network interface arrive at the left side of the diagram. After the verification of the IP header checksum, the NF\_IP\_PRE\_ROUTING [1] hook is traversed. If they ``survive'' (i.e. NF\_ACCEPT is returned), the packet enters the routing code. Where we continue from here depends on the destintion of the packet. Packets with a local destination (i.e. packets where the destination address is one of the own IP addresses of the host) traverse the NF\_IP\_LOCAL\_IN [2] hook. If all callback function return NF\_ACCEPT, the packet is finally passed to the socket code, which eventually passes the packet to a local process. Packets with a remote destination (i.e. packets which are forwarded by the local machine) traverse the NF\_IP\_FORWARD [3] hook. If they ``survive'', they finally pass the NF\_IP\_POST\_ROUTING [4] hook and are sent off the outgoing network interface. Locally generated packets first traverse the NF\_IP\_LOCAL\_OUT [5] hook, then enter the routing code, and finally go through the NF\_IP\_POST\_ROUTING [4] hook before being sent off the outgoing network interface. \subsubsection{Netfilter hooks within IPv6} As the IPv4 and IPv6 protocols are very similar, the netfilter hooks within the IPv6 stack are placed at exactly the same locations as in the IPv4 stack. The only change are the hook names: NF\_IP6\_PRE\_ROUTING, NF\_IP6\_LOCAL\_IN, NF\_IP6\_FORWARD, NF\_IP6\_POST\_ROUTING, NF\_IP6\_LOCAL\_OUT. \subsubsection{Netfilter hooks within DECnet} There are seven decnet hooks. The first five hooks (NF\_DN\_PRE\_ROUTING, NF\_DN\_LOCAL\_IN, NF\_DN\_FORWARD, NF\_DN\_LOCAL\_OUT, NF\_DN\_POST\_ROUTING) are prretty much the same as in IPv4. The last two hooks (NF\_DN\_HELLO, NF\_DN\_ROUTE) are used in conjunction with DECnet Hello and Routing packets. \subsubsection{Netfilter hooks within ARP} Recent kernels\footnote{IIRC, starting with 2.4.19-pre3} have added support for netfilter hooks within the ARP code. There are two hooks: NF\_ARP\_IN and NF\_ARP\_OUT, for incoming and outgoing ARP packets respectively. \subsubsection{Netfilter hooks within IPX} There have been experimental patches to add netfilter hooks to the IPX code, but they never got integrated into the kernel source. \subsection{Packet selection using IP Tables} The IP tables core (ip\_tables.o) provides a generic layer for evaluation of rulesets. An IP table consists out of an arbitrary number of {\it chains}, which in turn consist out of a linear list of {\it rules}, which again consist out of any number of {\it matches} and one {\it target}. {\it Chains} can further be devided into two classes: Either {\it builtin chains} or {\it user-defined chains}. Builtin chains are always present, they are created upon table registration. They are also the entry points for table iteration. User defined chains are created at runtime upon user interaction. {\it Matches} specify the matching criteria, there can be zero or more matches {\it Targets} specify the action which is to be executed in case {\bf all} matches match. There can only be a single target per rule. Matches and targets can either be {\it builtin} or {\it linux kernel modules}. There are two special targets: \begin{itemize} \item By using a chain name as target, it is possible to jump to the respective chain in case the matches match. \item By using the RETURN target, it is possible to return to the previous (calling) chain \end{itemize} The IP tables core handles the following functions \begin{itemize} \item Registering and unregistering tables \item Registering and unregistering matches and targets (can be implemented as linux kernel modules) \item Kernel / userspace interface for manipulation of IP tables \item Traversal of IP tables \end{itemize} \subsubsection{Packet filtering unsing the ``filter'' table} Traditional packet filtering (i.e. the successor to ipfwadm/ipchains) takes place in the ``filter'' table. Packet filtering works like a sieve: A packet is (in the end) either dropped or accepted - but never modified. The ``filter'' table is implemented in the {\it iptable\_filter.o} module and contains three builtin chains: \begin{itemize} \item {\bf INPUT} attaches to NF\_IP\_LOCAL\_IN \item {\bf FORWARD} attaches to NF\_IP\_FORWARD \item {\bf OUTPUT} attaches to NF\_IP\_LOCAL\_OUT \end{itemize} The placement of the chains / hooks is done in such way, that evey concievable packet always traverses only one of the built-in chains. Packets destined for the local host traverse only INPUT, packets forwarded only FORWARD and locally-originated packets only OUTPUT. \subsubsection{Packet mangling using the ``mangle'' table} As stated above, operations which would modify a packet do not belong in the ``filter'' table. The ``mangle'' table is available for all kinds of packet manipulation - but not manipulation of addresses (which is NAT). The mangle table attaches to all five netfilter hooks and provides the respectiva builtin chains (PREROUTING, INPUT, FORWARD, OUTPUT, POSTROUTING) \footnote{This has changed through recent 2.4.x kernel series, old kernels may only support three (PREROUTING, POSTROUTING, OUTPUT) chains.}. \subsection{Connection Tracking Subsystem} Traditional packet filters can only match on matching criteria within the currently processed packet, like source/destination IP address, port numbers, TCP flags, etc. As most applications have a notion of connections or at least a request/response style protocol, there is a lot of information which can not be derived from looking at a single packet. Thus, modern (stateful) packet filters attempt to track connections (flows) and their respective protocol states for all traffic through the packet filter. Connection tracking within linux is implemented as a netfilter module, called ip\_conntrack.o. Before describing the connection tracking subsystem, we need to describe a couple of definitions and primitives used throughout the conntrack code. A connection is represented within the conntrack subsystem using {\it struct ip\_conntrack}, also called {\it connection tracking entry}. Connection tracking is utilizing {\it conntrack tuples}, which are tuples consisting out of (srcip, srcport, dstip, dstport, l4prot). A connection is uniquely identified by two tuples: The tuple in the original direction (IP\_CT\_DIR\_ORIGINAL) and the tuple for the reply direction (IP\_CT\_DIR\_REPLY). Connection tracking itself does not drop packets\footnote{well, in some rare cases in combination with NAT it needs to drop. But don't tell anyone, this is secret.} or impose any policy. It just associates every packet with a connection tracking entry, which in turn has a particular state. All other kernel code can use this state information\footnote{state information is internally represented via the {\it struct sk\_buff.nfct} structure member of a packet.}. \subsubsection{Integration of conntrack with netfilter} If the ip\_conntrack.o module is registered with netfilter, it attaches to the NF\_IP\_PRE\_ROUTING, NF\_IP\_POST\_ROUTING, NF\_IP\_LOCAL\_IN and NF\_IP\_LOCAL\_OUT hooks. Because forwarded packets are the most common case on firewalls, I will only describe how connection tracking works for forwarded packets. The two relevant hooks for forwarded packets are NF\_IP\_PRE\_ROUTING and NF\_IP\_POST\_ROUTING. Every time a packet arrives at the NF\_IP\_PRE\_ROUTING hook, connection tracking creates a conntrack tuple from the packet. It then compares this tuple to the original and reply tuples of all already-seen connections \footnote{Of course this is not implemented as a linear search over all existing connections.} to find out if this just-arrived packet belongs to any existing connection. If there is no match, a new conntrack table entry (struct ip\_conntrack) is created. Let's assume the case where we have already existing connections but are starting from scratch. The first packet comes in, we derive the tuple from the packet headers, look up the conntrack hash table, don't find any matching entry. As a result, we create a new struct ip\_conntrack. This struct ip\_conntrack is filled with all necessarry data, like the original and reply tuple of the connection. How do we know the reply tuple? By inverting the source and destination parts of the original tuple.\footnote{So why do we need two tuples, if they can be derived from each other? Wait until we discuss NAT.} Please note that this new struct ip\_conntrack is {\bf not} yet placed into the conntrack hash table. The packet is now passed on to other callback functions which have registered with a lower priority at NF\_IP\_PRE\_ROUTING. It then continues traversal of the network stack as usual, including all respective netfilter hooks. If the packet survives (i.e. is not dropped by the routing code, network stack, firewall ruleset, ...), it re-appears at NF\_IP\_POST\_ROUTING. In this case, we can now safely assume that this packet will be sent off on the outgoing interface, and thus put the connection tracking entry which we created at NF\_IP\_PRE\_ROUTING into the conntrack hash table. This process is called {\it confirming the conntrack}. The connection tracking code itself is not monolithic, but consists out of a couple of seperate modules\footnote{They don't actually have to be seperate kernel modules; e.g. TCP, UDP and ICMP tracking modules are all part of the linux kernel module ip\_conntrack.o}. Besides the conntrack core, there are two important kind of modules: Protocol helpers and application helpers. Protocol helpers implement the layer-4-protocol specific parts. They currently exist for TCP, UDP and ICMP (an experimental helper for GRE exists). \subsubsection{TCP connection tracking} As TCP is a connection oriented protocol, it is not very difficult to imagine how conntection tracking for this protocol could work. There are well-defined state transitions possible, and conntrack can decide which state transitions are valid within the TCP specification. In reality it's not all that easy, since we cannot assume that all packets that pass the packet filter actually arrive at the receiving end, ... It is noteworthy that the standard connection tracking code does {\bf not} do TCP sequence number and window tracking. A well-maintained patch to add this feature exists almost as long as connection tracking itself. It will be integrated with the 2.5.x kernel. The problem with window tracking is it's bad interaction with connection pickup. The TCP conntrack code is able to pick up already existing connections, e.g. in case your firewall was rebooted. However, connection pickup is conflicting with TCP window tracking: The TCP window scaling option is only transferred at connection setup time, and we don't know about it in case of pickup... \subsubsection{ICMP tracking} ICMP is not really a connection oriented protocol. So how is it possible to do connection tracking for ICMP? The ICMP protocol can be split in two groups of messages \begin{itemize} \item ICMP error messages, which sort-of belong to a different connection ICMP error messages are associated {\it RELATED} to a different connection. (ICMP\_DEST\_UNREACH, ICMP\_SOURCE\_QUENCH, ICMP\_TIME\_EXCEEDED, ICMP\_PARAMETERPROB, ICMP\_REDIRECT). \item ICMP queries, which have a request->reply character. So what the conntrack code does, is let the request have a state of {\it NEW}, and the reply {\it ESTABLISHED}. The reply closes the connection immediately. (ICMP\_ECHO, ICMP\_TIMESTAMP, ICMP\_INFO\_REQUEST, ICMP\_ADDRESS) \end{itemize} \subsubsection{UDP connection tracking} UDP is designed as a connectionless datagram protocol. But most common protocols using UDP as layer 4 protocol have bi-directional UDP communication. Imagine a DNS query, where the client sends an UDP frame to port 53 of the nameserver, and the nameserver sends back a DNS reply packet from it's UDP port 53 to the client. Netfilter trats this as a connection. The first packet (the DNS request) is assigned a state of {\it NEW}, because the packet is expected to create a new 'connection'. The dns servers' reply packet is marked as {\it ESTABLISHED}. \subsubsection{conntrack application helpers} More complex application protocols involving multiple connections need special support by a so-called ``conntrack application helper module''. Modules in the stock kernel come for FTP and IRC(DCC). Netfilter CVS currently contains patches for PPTP, H.323, Eggdrop botnet, tftp ald talk. We're still lacking a lot of protocols (e.g. SIP, SMB/CIFS) - but they are unlikely to appear until somebody really needs them and either develops them on his own or funds development. \subsubsection{Integration of connection tracking with iptables} As stated earlier, conntrack doesn't impose any policy on packets. It just determines the relation of a packet to already existing connections. To base packet filtering decision on this sate information, the iptables {\it state} match can be used. Every packet is within one of the following categories: \begin{itemize} \item {\bf NEW}: packet would create a new connection, if it survives \item {\bf ESTABLISHED}: packet is part of an already established connection (either direction) \item {\bf RELATED}: packet is in some way related to an already established connection, e.g. ICMP errors or FTP data sessions \item {\bf INVALID}: conntrack is unable to derive conntrack information from this packet. Please note that all multicast or broadcast packets fall in this category. \end{itemize} \subsection{NAT Subsystem} The NAT (Network Address Translation) subsystem is probably the worst documented subsystem within the whole framework. This has two reasons: NAT is nasty and complicated. The Linux 2.4.x NAT implementation is easy to use, so nobody needs to know the nasty details. Nonetheless, as I was traditionally concentrating most on the conntrack and NAT systems, I will give a short overview. NAT uses almost all of the previously described subsystems: \begin{itemize} \item IP tables to specify which packets to NAT in which particular way. NAT registers a ``nat'' table with PREROUTING, POSTROUTING and OUTPUT chains. \item Connection tracking to associate NAT state with the connection. \item Netfilter to do the actuall packet manipulation transparent to the rest of the kernel. NAT registers with NF\_IP\_PRE\_ROUTING, NF\_IP\_POST\_ROUTING, NF\_IP\_LOCAL\_IN and NF\_IP\_LOCAL\_OUT. \end{itemize} The NAT implementation supports all kinds of different nat; Source NAT, Destination NAT, NAT to address/port ranges, 1:1 NAT, ... This fundamental design principle is still frequently misunderstood:\\ The information about which NAT mappings apply to a certain connection is only gathered once - with the first packet of every connection. So let's start to look at the life of a poor to-be-nat'ed packet. For ease of understanding, I have chosen to describe the most frequently used NAT scenario: Source NAT of a forwarded packet. Let's assume the packet has an original source address of 1.1.1.1, an original destination address of 2.2.2.2, and is going to be SNAT'ed to 9.9.9.9. Let's further ignore the fact that there are port numbers. Once upon a time, our poor packet arrives at NF\_IP\_PRE\_ROUTING, where conntrack has registered with highest priority. This means that a conntrack entry with the following two tuples is created: \begin{verbatim} IP_CT_DIR_ORIGINAL: 1.1.1.1 -> 2.2.2.2 IP_CT_DIR_REPLY: 2.2.2.2 -> 1.1.1.1 \end{verbatim} After conntrack, the packet traverses the PREROUTING chain of the ``nat'' IP table. Since only destination NAT happens at PREROUTING, no action occurs. After it's lengthy way through the rest of the network stack, the packet arrives at the NF\_IP\_POST\_ROUTING hook, where it traverses the POSTROUTING chain of the ``nat'' table. Here it hits a SNAT rule, causing the following actions: \begin{itemize} \item Fill in a {\it struct ip\_nat\_manip}, indicating the new source address and the type of NAT (source NAT at POSTROUTING). This struct is part of the conntrack entry. \item Automatically derive the inverse NAT transormation for the reply packets: Destination NAT at PREROUTING. Fill in another {\it struct ip\_nat\_manip}. \item Alter the REPLY tuple of the conntrack entry to \begin{verbatim} IP_CT_DIR_REPLY: 2.2.2.2 -> 9.9.9.9 \end{verbatim} \item Apply the SNAT transformation to the packet \end{itemize} Every other packt within this connection, independent of its direction, will only execute the last step. Since all NAT information is connected with the conntrack entry, there is no need to do anything but to apply the same transormations to all packets witin the same connection. \subsection{IPv6 Firewalling with ip6tables} Yes, Linux 2.4.x comes with a usable, though incomplete system to secure your IPv6 network. The parts ported to IPv6 are \begin{itemize} \item IP tables (called IP6 tables) \item The ``filter'' table \item The ``mangle'' table \item The userspace library (libip6tc) \item The command line tool (ip6tables) \end{itemize} Due to the lack of conntrack and NAT\footnote{for god's sake we don't have NAT with IPv6}, only traditional, stateless packet filtering is possible. Apart from the obvious matches/targets, ip6tables can match on \begin{itemize} \item {\it EUI64 checker}; verifies if the MAC address of the sender is the same as in the EUI64 64 least significant bits of the source IPv6 address \item {\it frag6 match}, matches on IPv6 fragmentation header \item {\it route6 match}, matches on IPv6 routing header \item {\it ahesp6 match}, matches on SPIDs within AH or ESP over IPv6 packets \end{itemize} However, the ip6tables code doesn't seem to be used very widely (yet?). So please expect some potential remaining issues, since it is not tested as heavily as iptables. \subsection{Recent Development} Please refer to the spoken word at the presentation. Development at the time this paper was written can be quite different from development at the time the presentation is held. \section{Thanks} I'd like to thank \begin{itemize} \item {\it Linus Torvalds} for starting this interesting UNIX-like kernel \item {\it Alan Cox, David Miller, Alexey Kuznetesov, Andi Kleen} for building (one of?) the world's best TCP/IP stacks. \item {\it Paul ``Rusty'' Russell} for starting the netfilter/iptables project \item {\it The Netfilter Core Team} for continuing the netfilter/iptables effort \item {\it Astaro AG} for partially funding my current netfilter/iptables work \item {\it Conectiva Inc.} for partially funding parts of my past netfilter/iptables work and for inviting me to live in Brazil \item {\it samba.org and Kommunikationsnetz Franken e.V.} for hosting the netfilter homepage, CVS, mailing lists, ... \end{itemize} \end{document}