1 files changed, 537 insertions, 0 deletions
diff --git a/2002/netfilter-internals-lt2002/netfilter-internals-lt2002.tex b/2002/netfilter-internals-lt2002/netfilter-internals-lt2002.tex
new file mode 100644
index 0000000..c3a28ea
--- /dev/null
+++ b/2002/netfilter-internals-lt2002/netfilter-internals-lt2002.tex
@@ -0,0 +1,537 @@
+\documentclass{article}
+\usepackage{german}
+\usepackage{fancyheadings}
+\usepackage{a4}
+
+\setlength{\oddsidemargin}{0in}
+\setlength{\evensidemargin}{0in}
+\setlength{\topmargin}{0.0in}
+\setlength{\headheight}{0in}
+\setlength{\headsep}{0in}
+\setlength{\textwidth}{6.5in}
+\setlength{\textheight}{9.5in}
+\setlength{\parindent}{0in}
+\setlength{\parskip}{0.05in}
+
+
+\begin{document}
+\title{Linux 2.4.x netfilter/iptables firewalling internals}
+
+\author{Harald Welte\\
+        laforge@gnumonks.org\\
+	\copyright{}2002 H. Welte}
+
+\date{25. April 2002}
+
+\maketitle
+
+\setcounter{section}{0}
+\setcounter{subsection}{0}
+\setcounter{subsubsection}{0}
+
+\section{Introduction}
+The Linux 2.4.x kernel series has introduced a totally new kernel firewalling
+subsystem.  It is much more than a plain successor of ipfwadm or ipchains.
+
+The netfilter/iptables project has a very modular design and it's
+sub-projects can be split in several parts: netfilter, iptables, connection
+tracking, NAT and packet mangling.
+
+While most users will already have learned how to use the basic functions
+of netfilter/iptables in order to convert their old ipchains firewalls to
+iptables, there's more advanced but less used functionality in
+netfilter/iptables.
+
+The presentation covers the design principles behind the netfilter/iptables
+implementation.  This knowledge enables us to understand how the individual
+parts of netfilter/iptables fit together, and for which potential applications
+this is useful.
+
+\section{Internal netfilter/iptables architecture}
+
+\subsection{Netfilter hooks in protocol stacks}
+
+One of the major motivations behind the redesign of the linux packet
+filtering and NAT system during the 2.3.x kernel series was the widespread
+firewall specific code parts within the core IPv4 stack.  Ideally the core
+IPv4 stack (as used by regular hosts and routers) shouldn't contain any
+firewalling specific code, resulting in no unwanted interaction and less
+code complexity.  This desire lead to the invention of {\it netfilter}.
+
+\subsubsection{Architecture of netfilter}
+
+Netfilter is basically a system of callback functions within the network
+stack.  It provides a non-portable API towards in-kernel networking
+extensions.
+  
+What we call {\it netfilter hook} is a well-defined call-out point within a
+layer three protocol stack, such as IPv4, IPv6 or DECnet.  Any layer three
+network stack can define an arbitrary number of hooks, usually placed at
+strategic points within the packet flow.
+
+Any other kernel code can now subsequently register callback functions for
+any of these hooks.  As in most sytems will be more than one callback
+function registered for a particular hook, a {\it priority} is specified upon
+registration of the callback function.  This priority defines the order in
+which the individual callback functions at a particular hook are called.
+
+The return value of any registered callback functions can be:
+\begin{itemize}
+\item
+{\bf NF\_ACCEPT}: continue traversal as usual
+\item
+{\bf NF\_DROP}: drop the packet; do not continue traversal
+\item
+{\bf NF\_STOLEN}: callback function has taken over the packet; do not continue
+\item
+{\bf NF\_QUEUE}: enqueue the packet to userspace
+\item
+{\bf NF\_REPEAT}: call this hook again
+\end{itemize}
+
+\subsubsection{Netfilter hooks within IPv4}
+
+The IPv4 stack provides five netfilter hooks, which are placed at the
+following peculiar places within the code:
+
+\begin{verbatim}
+   --->[1]--->[ROUTE]--->[3]--->[4]--->
+                 |            ^
+                 |            |
+                 |         [ROUTE]
+                 v            |
+                [2]          [5]
+                 |            ^
+                 |            |
+                 v            |
+
+                local processes
+\end{verbatim}
+
+Packets received on any network interface arrive at the left side of the
+diagram.  After the verification of the IP header checksum, the
+NF\_IP\_PRE\_ROUTING [1] hook is traversed.  
+
+If they ``survive'' (i.e.  NF\_ACCEPT is returned), the packet enters the
+routing code.  Where we continue from here depends on the destintion of the
+packet.
+
+Packets with a local destination (i.e. packets where the destination address is
+one of the own IP addresses of the host) traverse the NF\_IP\_LOCAL\_IN [2]
+hook.  If all callback function return NF\_ACCEPT, the packet is finally passed
+to the socket code, which eventually passes the packet to a local process.
+
+Packets with a remote destination (i.e. packets which are forwarded by the
+local machine) traverse the NF\_IP\_FORWARD [3] hook.  If they ``survive'',
+they finally pass the NF\_IP\_POST\_ROUTING [4] hook and are sent off the
+outgoing network interface.
+
+Locally generated packets first traverse the NF\_IP\_LOCAL\_OUT [5] hook, then
+enter the routing code, and finally go through the NF\_IP\_POST\_ROUTING [4]
+hook before being sent off the outgoing network interface.
+
+\subsubsection{Netfilter hooks within IPv6}
+
+As the IPv4 and IPv6 protocols are very similar, the netfilter hooks within the
+IPv6 stack are placed at exactly the same locations as in the IPv4 stack.  The
+only change are the hook names: NF\_IP6\_PRE\_ROUTING, NF\_IP6\_LOCAL\_IN,
+NF\_IP6\_FORWARD, NF\_IP6\_POST\_ROUTING, NF\_IP6\_LOCAL\_OUT.
+
+\subsubsection{Netfilter hooks within DECnet}
+
+There are seven decnet hooks.  The first five hooks (NF\_DN\_PRE\_ROUTING,
+NF\_DN\_LOCAL\_IN, NF\_DN\_FORWARD, NF\_DN\_LOCAL\_OUT, NF\_DN\_POST\_ROUTING)
+are prretty much the same as in IPv4.  The last two hooks (NF\_DN\_HELLO,
+NF\_DN\_ROUTE) are used in conjunction with DECnet Hello and Routing packets.
+
+\subsubsection{Netfilter hooks within ARP}
+
+Recent kernels\footnote{IIRC, starting with 2.4.19-pre3} have added support for netfilter hooks within the ARP code.
+There are two hooks: NF\_ARP\_IN and NF\_ARP\_OUT, for incoming and outgoing
+ARP packets respectively.
+
+\subsubsection{Netfilter hooks within IPX}
+
+There have been experimental patches to add netfilter hooks to the IPX code,
+but they never got integrated into the kernel source.
+
+\subsection{Packet selection using IP Tables}
+
+The IP tables core (ip\_tables.o) provides a generic layer for evaluation
+of rulesets.  
+
+An IP table consists out of an arbitrary number of {\it chains}, which in turn
+consist out of a linear list of {\it rules}, which again consist out of any
+number of {\it matches} and one {\it target}.
+
+{\it Chains} can further be devided into two classes: Either {\it builtin
+chains} or {\it user-defined chains}.  Builtin chains are always present, they
+are created upon table registration.  They are also the entry points for table
+iteration.  User defined chains are created at runtime upon user interaction.
+
+{\it Matches} specify the matching criteria, there can be zero or more matches
+
+{\it Targets} specify the action which is to be executed in case {\bf all}
+matches match.  There can only be a single target per rule.
+
+Matches and targets can either be {\it builtin} or {\it linux kernel modules}.
+
+There are two special targets:
+\begin{itemize}
+\item
+By using a chain name as target, it is possible to jump to the respective chain
+in case the matches match.
+\item
+By using the RETURN target, it is possible to return to the previous (calling)
+chain
+\end{itemize}
+
+The IP tables core handles the following functions
+\begin{itemize}
+\item
+Registering and unregistering tables
+\item
+Registering and unregistering matches and targets (can be implemented as linux kernel modules)
+\item
+Kernel / userspace interface for manipulation of IP tables
+\item
+Traversal of IP tables
+\end{itemize}
+
+\subsubsection{Packet filtering unsing the ``filter'' table}
+
+Traditional packet filtering (i.e. the successor to ipfwadm/ipchains) takes
+place in the ``filter'' table.  Packet filtering works like a sieve: A packet
+is (in the end) either dropped or accepted - but never modified.
+
+The ``filter'' table is implemented in the {\it iptable\_filter.o} module
+and contains three builtin chains:
+
+\begin{itemize}
+\item
+{\bf INPUT} attaches to NF\_IP\_LOCAL\_IN
+\item
+{\bf FORWARD} attaches to NF\_IP\_FORWARD
+\item
+{\bf OUTPUT} attaches to NF\_IP\_LOCAL\_OUT
+\end{itemize}
+
+The placement of the chains / hooks is done in such way, that evey concievable
+packet always traverses only one of the built-in chains.  Packets destined for
+the local host traverse only INPUT, packets forwarded only FORWARD and
+locally-originated packets only OUTPUT.
+
+\subsubsection{Packet mangling using the ``mangle'' table} 
+
+As stated above, operations which would modify a packet do not belong in the
+``filter'' table.   The ``mangle'' table is available for all kinds of packet
+manipulation - but not manipulation of addresses (which is NAT).
+
+The mangle table attaches to all five netfilter hooks and provides the
+respectiva builtin chains (PREROUTING, INPUT, FORWARD, OUTPUT, POSTROUTING)
+\footnote{This has changed through recent 2.4.x kernel series, old kernels may
+only support three (PREROUTING, POSTROUTING, OUTPUT) chains.}.
+
+\subsection{Connection Tracking Subsystem}
+
+Traditional packet filters can only match on matching criteria within the
+currently processed packet, like source/destination IP address, port numbers,
+TCP flags, etc.  As most applications have a notion of connections or at least
+a request/response style protocol, there is a lot of information which can not
+be derived from looking at a single packet.
+
+Thus, modern (stateful) packet filters attempt to track connections (flows)
+and their respective protocol states for all traffic through the packet
+filter.
+
+Connection tracking within linux is implemented as a netfilter module, called
+ip\_conntrack.o.  
+
+Before describing the connection tracking subsystem, we need to describe a couple of definitions and primitives used throughout the conntrack code.
+
+A connection is represented within the conntrack subsystem using {\it struct
+ip\_conntrack}, also called {\it connection tracking entry}.
+
+Connection tracking is utilizing {\it conntrack tuples}, which are tuples
+consisting out of (srcip, srcport, dstip, dstport, l4prot).  A connection is
+uniquely identified by two tuples:  The tuple in the original direction
+(IP\_CT\_DIR\_ORIGINAL) and the tuple for the reply direction
+(IP\_CT\_DIR\_REPLY).
+
+Connection tracking itself does not drop packets\footnote{well, in some rare
+cases in combination with NAT it needs to drop. But don't tell anyone, this is
+secret.} or impose any policy.  It just associates every packet with a
+connection tracking entry, which in turn has a particular state.  All other
+kernel code can use this state information\footnote{state information is
+internally represented via the {\it struct sk\_buff.nfct} structure member of a
+packet.}.
+
+\subsubsection{Integration of conntrack with netfilter} 
+
+If the ip\_conntrack.o module is registered with netfilter, it attaches to the
+NF\_IP\_PRE\_ROUTING, NF\_IP\_POST\_ROUTING, NF\_IP\_LOCAL\_IN and
+NF\_IP\_LOCAL\_OUT hooks.
+
+Because forwarded packets are the most common case on firewalls, I will only
+describe how connection tracking works for forwarded packets.  The two relevant
+hooks for forwarded packets are NF\_IP\_PRE\_ROUTING and NF\_IP\_POST\_ROUTING.
+
+Every time a packet arrives at the NF\_IP\_PRE\_ROUTING hook, connection
+tracking creates a conntrack tuple from the packet.  It then compares this
+tuple to the original and reply tuples of all already-seen connections
+\footnote{Of course this is not implemented as a linear search over all existing connections.} to find out if this just-arrived packet belongs to any existing
+connection.  If there is no match, a new conntrack table entry (struct
+ip\_conntrack) is created.
+
+Let's assume the case where we have already existing connections but are
+starting from scratch.
+
+The first packet comes in, we derive the tuple from the packet headers, look up
+the conntrack hash table, don't find any matching entry.  As a result, we
+create a new struct ip\_conntrack.  This struct ip\_conntrack is filled with
+all necessarry data, like the original and reply tuple of the connection.
+How do we know the reply tuple?  By inverting the source and destination
+parts of the original tuple.\footnote{So why do we need two tuples, if they can
+be derived from each other?  Wait until we discuss NAT.}
+Please note that this new struct ip\_conntrack is {\bf not} yet placed
+into the conntrack hash table.
+
+The packet is now passed on to other callback functions which have registered
+with a lower priority at NF\_IP\_PRE\_ROUTING.  It then continues traversal of
+the network stack as usual, including all respective netfilter hooks.
+
+If the packet survives (i.e. is not dropped by the routing code, network stack,
+firewall ruleset, ...), it re-appears at NF\_IP\_POST\_ROUTING.  In this case,
+we can now safely assume that this packet will be sent off on the outgoing
+interface, and thus put the connection tracking entry which we created at
+NF\_IP\_PRE\_ROUTING into the conntrack hash table.  This process is called
+{\it confirming the conntrack}.
+
+The connection tracking code itself is not monolithic, but consists out of a
+couple of seperate modules\footnote{They don't actually have to be seperate
+kernel modules; e.g. TCP, UDP and ICMP tracking modules are all part of
+the linux kernel module ip\_conntrack.o}.  Besides the conntrack core, there
+are two important kind of modules: Protocol helpers and application helpers.
+
+Protocol helpers implement the layer-4-protocol specific parts.  They currently
+exist for TCP, UDP and ICMP (an experimental helper for GRE exists).
+
+\subsubsection{TCP connection tracking}
+
+As TCP is a connection oriented protocol, it is not very difficult to imagine
+how conntection tracking for this protocol could work.  There are well-defined
+state transitions possible, and conntrack can decide which state transitions
+are valid within the TCP specification.  In reality it's not all that easy,
+since we cannot assume that all packets that pass the packet filter actually
+arrive at the receiving end, ...
+
+It is noteworthy that the standard connection tracking code does {\bf not}
+do TCP sequence number and window tracking.  A well-maintained patch to add
+this feature exists almost as long as connection tracking itself.  It will
+be integrated with the 2.5.x kernel.  The problem with window tracking is
+it's bad interaction with connection pickup.  The TCP conntrack code is able to
+pick up already existing connections, e.g. in case your firewall was rebooted.
+However, connection pickup is conflicting with TCP window tracking:  The TCP
+window scaling option is only transferred at connection setup time, and we
+don't know about it in case of pickup...
+
+\subsubsection{ICMP tracking}
+
+ICMP is not really a connection oriented protocol.  So how is it possible to
+do connection tracking for ICMP?
+
+The ICMP protocol can be split in two groups of messages
+
+\begin{itemize}
+\item
+ICMP error messages, which sort-of belong to a different connection
+ICMP error messages are associated {\it RELATED} to a different connection.
+(ICMP\_DEST\_UNREACH, ICMP\_SOURCE\_QUENCH, ICMP\_TIME\_EXCEEDED,
+ICMP\_PARAMETERPROB, ICMP\_REDIRECT). 
+\item
+ICMP queries, which have a request->reply character.  So what the conntrack
+code does, is let the request have a state of {\it NEW}, and the reply 
+{\it ESTABLISHED}.  The reply closes the connection immediately.
+(ICMP\_ECHO, ICMP\_TIMESTAMP, ICMP\_INFO\_REQUEST, ICMP\_ADDRESS)
+\end{itemize}
+
+\subsubsection{UDP connection tracking}
+
+UDP is designed as a connectionless datagram protocol. But most common
+protocols using UDP as layer 4 protocol have bi-directional UDP communication.
+Imagine a DNS query, where the client sends an UDP frame to port 53 of the
+nameserver, and the nameserver sends back a DNS reply packet from it's  UDP
+port 53 to the client.
+
+Netfilter trats this as a connection. The first packet (the DNS request) is
+assigned a state of {\it NEW}, because the packet is expected to create a new
+'connection'. The dns servers' reply packet is marked as {\it ESTABLISHED}.
+
+\subsubsection{conntrack application helpers}
+
+More complex application protocols involving multiple connections need special
+support by a so-called ``conntrack application helper module''.  Modules in
+the stock kernel come for FTP and IRC(DCC).  Netfilter CVS currently contains
+patches for PPTP, H.323, Eggdrop botnet, tftp ald talk.  We're still lacking
+a lot of protocols (e.g. SIP, SMB/CIFS) - but they are unlikely to appear
+until somebody really needs them and either develops them on his own or
+funds development.
+
+\subsubsection{Integration of connection tracking with iptables}
+
+As stated earlier, conntrack doesn't impose any policy on packets.  It just
+determines the relation of a packet to already existing connections.  To base
+packet filtering decision on this sate information, the iptables {\it state}
+match can be used.  Every packet is within one of the following categories:
+
+\begin{itemize}
+\item
+{\bf NEW}: packet would create a new connection, if it survives
+\item
+{\bf ESTABLISHED}: packet is part of an already established connection 
+(either direction)
+\item
+{\bf RELATED}: packet is in some way related to an already established connection, e.g. ICMP errors or FTP data sessions
+\item
+{\bf INVALID}: conntrack is unable to derive conntrack information from this packet.  Please note that all multicast or broadcast packets fall in this category.
+\end{itemize}
+
+\subsection{NAT Subsystem}
+
+The NAT (Network Address Translation)  subsystem is probably the worst
+documented subsystem within the whole framework.  This has two reasons:  NAT is
+nasty and complicated.  The Linux 2.4.x NAT implementation is easy to use, so
+nobody needs to know the nasty details.
+
+Nonetheless, as I was traditionally concentrating most on the conntrack and NAT
+systems, I will give a short overview.
+
+NAT uses almost all of the previously described subsystems:
+\begin{itemize}
+\item
+IP tables to specify which packets to NAT in which particular way. NAT
+registers a ``nat'' table with PREROUTING, POSTROUTING and OUTPUT chains. 
+\item
+Connection tracking to associate NAT state with the connection.
+\item
+Netfilter to do the actuall packet manipulation transparent to the rest of the
+kernel.  NAT registers with NF\_IP\_PRE\_ROUTING, NF\_IP\_POST\_ROUTING,
+NF\_IP\_LOCAL\_IN and NF\_IP\_LOCAL\_OUT.
+\end{itemize}
+
+The NAT implementation supports all kinds of different nat; Source NAT,
+Destination NAT, NAT to address/port ranges, 1:1 NAT, ...
+
+This fundamental design principle is still frequently misunderstood:\\
+The information about which NAT mappings apply to a certain connection
+is only gathered once - with the first packet of every connection.
+
+So let's start to look at the life of a poor to-be-nat'ed packet.
+For ease of understanding, I have chosen to describe the most frequently
+used NAT scenario:  Source NAT of a forwarded packet.  Let's assume the
+packet has an original source address of 1.1.1.1, an original destination
+address of 2.2.2.2, and is going to be SNAT'ed to 9.9.9.9.  Let's further
+ignore the fact that there are port numbers.
+
+Once upon a time, our poor packet arrives at NF\_IP\_PRE\_ROUTING, where
+conntrack has registered with highest priority.  This means that a conntrack
+entry with the following two tuples is created:
+\begin{verbatim}
+IP_CT_DIR_ORIGINAL: 1.1.1.1 -> 2.2.2.2
+IP_CT_DIR_REPLY: 2.2.2.2 -> 1.1.1.1
+\end{verbatim}
+After conntrack, the packet traverses the PREROUTING chain of the ``nat''
+IP table.  Since only destination NAT happens at PREROUTING, no action
+occurs.  After it's lengthy way through the rest of the network stack,
+the packet arrives at the NF\_IP\_POST\_ROUTING hook, where it traverses
+the POSTROUTING chain of the ``nat'' table.  Here it hits a SNAT rule,
+causing the following actions:
+\begin{itemize}
+\item
+Fill in a {\it struct ip\_nat\_manip}, indicating the new source address
+and the type of NAT (source NAT at POSTROUTING).  This struct is part of the
+conntrack entry.
+\item
+Automatically derive the inverse NAT transormation for the reply packets:
+Destination NAT at PREROUTING.  Fill in another {\it struct ip\_nat\_manip}.
+\item
+Alter the REPLY tuple of the conntrack entry to
+\begin{verbatim}
+IP_CT_DIR_REPLY: 2.2.2.2 -> 9.9.9.9
+\end{verbatim}
+\item
+Apply the SNAT transformation to the packet
+\end{itemize}
+
+Every other packt within this connection, independent of its direction,
+will only execute the last step.  Since all NAT information is connected
+with the conntrack entry, there is no need to do anything but to apply
+the same transormations to all packets witin the same connection.
+
+\subsection{IPv6 Firewalling with ip6tables}
+
+Yes, Linux 2.4.x comes with a usable, though incomplete system to secure
+your IPv6 network.
+
+The parts ported to IPv6 are
+\begin{itemize}
+\item
+IP tables (called IP6 tables)
+\item
+The ``filter'' table
+\item
+The ``mangle'' table
+\item
+The userspace library (libip6tc)
+\item
+The command line tool (ip6tables)
+\end{itemize}
+
+Due to the lack of conntrack and NAT\footnote{for god's sake we don't have NAT
+with IPv6}, only traditional, stateless packet filtering is possible.  Apart
+from the obvious matches/targets, ip6tables can match on
+\begin{itemize}
+\item
+{\it EUI64 checker}; verifies if the MAC address of the sender is the same as in the EUI64 64 least significant bits of the source IPv6 address
+\item
+{\it frag6 match}, matches on IPv6 fragmentation header
+\item
+{\it route6 match}, matches on IPv6 routing header
+\item
+{\it ahesp6 match}, matches on SPIDs within AH or ESP over IPv6 packets
+\end{itemize}
+
+However, the ip6tables code doesn't seem to be used very widely (yet?). 
+So please expect some potential remaining issues, since it is not tested
+as heavily as iptables.
+
+\subsection{Recent Development}
+
+Please refer to the spoken word at the presentation.  Development at the 
+time this paper was written can be quite different from development at the
+time the presentation is held.
+
+\section{Thanks}
+
+I'd like to thank
+\begin{itemize}
+\item
+{\it Linus Torvalds} for starting this interesting UNIX-like kernel
+\item
+{\it Alan Cox, David Miller, Alexey Kuznetesov, Andi Kleen} for building 
+(one of?) the world's best TCP/IP stacks.
+\item
+{\it Paul ``Rusty'' Russell} for starting the netfilter/iptables project
+\item
+{\it The Netfilter Core Team} for continuing the netfilter/iptables effort
+\item
+{\it Astaro AG} for partially funding my current netfilter/iptables work
+\item
+{\it Conectiva Inc.} for partially funding parts of my past netfilter/iptables
+work and for inviting me to live in Brazil
+\item
+{\it samba.org and Kommunikationsnetz Franken e.V.} for hosting the netfilter
+homepage, CVS, mailing lists, ...
+\end{itemize}
+
+\end{document}