\documentclass[twocolumn,12pt]{article} \usepackage{alltt} \usepackage[T1]{fontenc} \usepackage[latin1]{inputenc} \usepackage{isolatin1} \usepackage{latexsym} \usepackage{textcomp} \usepackage{times} \usepackage{url} \usepackage[T1,obeyspaces]{zrl} % "verbatim" with line breaks, obeying spaces \providecommand\code{\begingroup \xrlstyle{tt}\Xrl} % as above, but okay to break lines at spaces \providecommand\brcode{\begingroup \zrlstyle{tt}\Zrl} % Same as the pair above, but 'l' for long == small type \providecommand\lcode{\begingroup \small\xrlstyle{tt}\Xrl} \providecommand\lbrcode{\begingroup \small\zrlstyle{tt}\Zrl} % For identifiers - "verbatim" with line breaks at punctuation \providecommand\ident{\begingroup \urlstyle{tt}\Url} \providecommand\lident{\begingroup \small\urlstyle{tt}\Url} \begin{document} % Required: do not print the date. \date{} \title{\texttt{ct\_sync}: state replication of \texttt{ip\_conntrack}\\ % {\normalsize Subtitle goes here} } \author{ Harald Welte \\ {\em netfilter core team / Astaro AG / hmw-consulting.de}\\ {\tt\normalsize laforge@gnumonks.org}\\ % \and % Second Author\\ % {\em Second Institution}\\ % {\tt\normalsize another@address.for.email.com}\\ } % end author section \maketitle % Required: do not use page numbers on title page. \thispagestyle{empty} \section*{Abstract} With traditional, stateless firewalling (such as ipfwadm, ipchains) there is no need for special HA support in the firewalling subsystem. As long as all packet filtering rules and routing table entries are configured in exactly the same way, one can use any available tool for IP-Address takeover to accomplish the goal of failing over from one node to the other. With Linux 2.4/2.6 netfilter/iptables, the Linux firewalling code moves beyond traditional packet filtering. Netfilter provides a modular connection tracking susbsystem which can be employed for stateful firewalling. The connection tracking subsystem gathers information about the state of all current network flows (connections). Packet filtering decisions and NAT information is associated with this state information. In a high availability scenario, this connection tracking state needs to be replicated from the currently active firewall node to all standby slave firewall nodes. Only when all connection tracking state is replicated, the slave node will have all necessary state information at the time a failover event occurs. Due to funding by Astaro AG, the netfilter/iptables project now offers a \ident{ct_sync} kernel module for replicating connection tracking state accross multiple nodes. The presentation will cover the architectural design and implementation of the connection tracking failover sytem. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%% BODY OF PAPER GOES HERE %%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Failover of stateless firewalls} There are no special precautions when installing a highly available stateless packet filter. Since there is no state kept, all information needed for filtering is the ruleset and the individual, separate packets. Building a set of highly available stateless packet filters can thus be achieved by using any traditional means of IP-address takeover, such as Heartbeat or VRRPd. The only remaining issue is to make sure the firewalling ruleset is exactly the same on both machines. This should be ensured by the firewall administrator every time he updates the ruleset and can be optionally managed by some scripts utilizing scp or rsync. If this is not applicable, because a very dynamic ruleset is employed, one can build a very easy solution using iptables-supplied tools iptables-save and iptables-restore. The output of iptables-save can be piped over ssh to iptables-restore on a different host. Limitations \begin{itemize} \item no state tracking \item not possible in combination with iptables stateful NAT \item no counter consistency of per-rule packet/byte counters \end{itemize} \section{Failover of stateful firewalls} Modern firewalls implement state tracking (a.k.a.\ connection tracking) in order to keep some state about the currently active sessions. The amount of per-connection state kept at the firewall depends on the particular configuration and networking protocols used. As soon as \texttt{any} state is kept at the packet filter, this state information needs to be replicated to the slave/backup nodes within the failover setup. Since Linux 2.4.x, all relevant state is kept within the \textit{connection tracking subsystem}. In order to understand how this state could possibly be replicated, we need to understand the architecture of this conntrack subsystem. \subsection{Architecture of the Linux Connection Tracking Subsystem} Connection tracking within Linux is implemented as a netfilter module, called \ident{ip_conntrack.o} (\ident{ip_conntrack.ko} in 2.6.x kernels). Before describing the connection tracking subsystem, we need to describe a couple of definitions and primitives used throughout the conntrack code. A connection is represented within the conntrack subsystem using \brcode{struct ip_conntrack}, also called \textit{connection tracking entry}. Connection tracking is utilizing \textit{conntrack tuples}, which are tuples consisting of \begin{itemize} \item source IP address \item source port (or icmp type/code, gre key, ...) \item destination IP address \item destination port \item layer 4 protocol number \end{itemize} A connection is uniquely identified by two tuples: The tuple in the original direction (\lident{IP_CT_DIR_ORIGINAL}) and the tuple for the reply direction (\lident{IP_CT_DIR_REPLY}). Connection tracking itself does not drop packets\footnote{well, in some rare cases in combination with NAT it needs to drop. But don't tell anyone, this is secret.} or impose any policy. It just associates every packet with a connection tracking entry, which in turn has a particular state. All other kernel code can use this state information\footnote{State information is referenced via the \brcode{struct sk_buff.nfct} structure member of a packet.}. \subsubsection{Integration of conntrack with netfilter} If the \ident{ip_conntrack.[k]o} module is registered with netfilter, it attaches to the \lident{NF_IP_PRE_ROUTING}, \lident{NF_IP_POST_ROUTING}, \lident{NF_IP_LOCAL_IN}, and \lident{NF_IP_LOCAL_OUT} hooks. Because forwarded packets are the most common case on firewalls, I will only describe how connection tracking works for forwarded packets. The two relevant hooks for forwarded packets are \lident{NF_IP_PRE_ROUTING} and \lident{NF_IP_POST_ROUTING}. Every time a packet arrives at the \lident{NF_IP_PRE_ROUTING} hook, connection tracking creates a conntrack tuple from the packet. It then compares this tuple to the original and reply tuples of all already-seen connections \footnote{Of course this is not implemented as a linear search over all existing connections.} to find out if this just-arrived packet belongs to any existing connection. If there is no match, a new conntrack table entry (\brcode{struct ip_conntrack}) is created. Let's assume the case where we have already existing connections but are starting from scratch. The first packet comes in, we derive the tuple from the packet headers, look up the conntrack hash table, don't find any matching entry. As a result, we create a new \brcode{struct ip_conntrack}. This \brcode{struct ip_conntrack} is filled with all necessarry data, like the original and reply tuple of the connection. How do we know the reply tuple? By inverting the source and destination parts of the original tuple.\footnote{So why do we need two tuples, if they can be derived from each other? Wait until we discuss NAT.} Please note that this new \brcode{struct ip_conntrack} is \textbf{not} yet placed into the conntrack hash table. The packet is now passed on to other callback functions which have registered with a lower priority at \lident{NF_IP_PRE_ROUTING}. It then continues traversal of the network stack as usual, including all respective netfilter hooks. If the packet survives (i.e., is not dropped by the routing code, network stack, firewall ruleset, \ldots), it re-appears at \lident{NF_IP_POST_ROUTING}. In this case, we can now safely assume that this packet will be sent off on the outgoing interface, and thus put the connection tracking entry which we created at \lident{NF_IP_PRE_ROUTING} into the conntrack hash table. This process is called \textit{confirming the conntrack}. The connection tracking code itself is not monolithic, but consists of a couple of separate modules\footnote{They don't actually have to be separate kernel modules; e.g.\ TCP, UDP, and ICMP tracking modules are all part of the linux kernel module \ident{ip_conntrack.o}.}. Besides the conntrack core, there are two important kind of modules: Protocol helpers and application helpers. Protocol helpers implement the layer-4-protocol specific parts. They currently exist for TCP, UDP, and ICMP (an experimental helper for GRE exists). \subsubsection{TCP connection tracking} As TCP is a connection oriented protocol, it is not very difficult to imagine how conntection tracking for this protocol could work. There are well-defined state transitions possible, and conntrack can decide which state transitions are valid within the TCP specification. In reality it's not all that easy, since we cannot assume that all packets that pass the packet filter actually arrive at the receiving end\ldots It is noteworthy that the standard connection tracking code does \textbf{not} do TCP sequence number and window tracking. A well-maintained patch to add this feature has existed for almost as long as connection tracking itself. It will be integrated with the 2.5.x kernel. The problem with window tracking is its bad interaction with connection pickup. The TCP conntrack code is able to pick up already existing connections, e.g.\ in case your firewall was rebooted. However, connection pickup is conflicting with TCP window tracking: The TCP window scaling option is only transferred at connection setup time, and we don't know about it in case of pickup\ldots \subsubsection{ICMP tracking} ICMP is not really a connection oriented protocol. So how is it possible to do connection tracking for ICMP? The ICMP protocol can be split in two groups of messages: \begin{itemize} \item ICMP error messages, which sort-of belong to a different connection ICMP error messages are associated \textit{RELATED} to a different connection. (\lident{ICMP_DEST_UNREACH}, \lident{ICMP_SOURCE_QUENCH}, \lident{ICMP_TIME_EXCEEDED}, \lident{ICMP_PARAMETERPROB}, \lident{ICMP_REDIRECT}). \item ICMP queries, which have a \ident{request-reply} character. So what the conntrack code does, is let the request have a state of \textit{NEW}, and the reply \textit{ESTABLISHED}. The reply closes the connection immediately. (\lident{ICMP_ECHO}, \lident{ICMP_TIMESTAMP}, \lident{ICMP_INFO_REQUEST}, \lident{ICMP_ADDRESS}) \end{itemize} \subsubsection{UDP connection tracking} UDP is designed as a connectionless datagram protocol. But most common protocols using UDP as layer 4 protocol have bi-directional UDP communication. Imagine a DNS query, where the client sends an UDP frame to port 53 of the nameserver, and the nameserver sends back a DNS reply packet from its UDP port 53 to the client. Netfilter treats this as a connection. The first packet (the DNS request) is assigned a state of \textit{NEW}, because the packet is expected to create a new `connection.' The DNS server's reply packet is marked as \textit{ESTABLISHED}. \subsubsection{conntrack application helpers} More complex application protocols involving multiple connections need special support by a so-called ``conntrack application helper module.'' Modules in the stock kernel come for FTP, IRC (DCC), TFTP and Amanda. Netfilter CVS currently contains %%% orig: ``tftp ald talk'' -- um, 'tftp and talk'? Yes, that's correct. It refers %%% to the talk protocol. patches for PPTP, H.323, Eggdrop botnet, mms, DirectX, RTSP and talk/ntalk. We're still lacking a lot of protocols (e.g.\ SIP, SMB/CIFS)---but they are unlikely to appear until somebody really needs them and either develops them on his own or funds development. \subsubsection{Integration of connection tracking with iptables} As stated earlier, conntrack doesn't impose any policy on packets. It just determines the relation of a packet to already existing connections. To base packet filtering decision on this state information, the iptables \textit{state} match can be used. Every packet is within one of the following categories: \begin{itemize} \item \textbf{NEW}: packet would create a new connection, if it survives \item \textbf{ESTABLISHED}: packet is part of an already established connection (either direction) \item \textbf{RELATED}: packet is in some way related to an already established connection, e.g.\ ICMP errors or FTP data sessions \item \textbf{INVALID}: conntrack is unable to derive conntrack information from this packet. Please note that all multicast or broadcast packets fall in this category. \end{itemize} \subsection{Poor man's conntrack failover} When thinking about failover of stateful firewalls, one usually thinks about replication of state. This presumes that the state is gathered at one firewalling node (the currently active node), and replicated to several other passive standby nodes. There is, however, a very different approach to replication: concurrent state tracking on all firewalling nodes. While this scheme has not been implemented within \ident{ct_sync}, the author still thinks it is worth an explanation in this paper. The basic assumption of this approach is: In a setup where all firewalling %%% deduct or deduce? I'd guess the latter, but I don't know, so I'm %%% leaving it... nodes receive exactly the same traffic, all nodes will deduct the same state information. The implementability of this approach is totally dependent on fulfillment of this assumption. \begin{itemize} \item \textit{All packets need to be seen by all nodes}. This is not always true, but can be achieved by using shared media like traditional ethernet (no switches!!) and promiscuous mode on all ethernet interfaces. \item \textit{All nodes need to be able to process all packets}. This cannot be universally guaranteed. Even if the hardware (CPU, RAM, Chipset, NICs) and software (Linux kernel) are exactly the same, they might behave different, especially under high load. To avoid those effects, the hardware should be able to deal with way more traffic than seen during operation. Also, there should be no userspace processes (like proxies, etc.) running on the firewalling nodes at all. WARNING: Nobody guarantees this behaviour. However, the poor man is usually not interested in scientific proof but in usability in his particular practical setup. \end{itemize} However, even if those conditions are fulfilled, there are remaining issues: \begin{itemize} \item \textit{No resynchronization after reboot}. If a node is rebooted (because of a hardware fault, software bug, software update, etc.) it will lose all state information until the event of the reboot. This means, the state information of this node after reboot will not contain any old state, gathered before the reboot. The effects depend on the traffic. Generally, it is only assured that state information about all connections initiated after the reboot will be present. If there are short-lived connections (like http), the state information on the just rebooted node will approximate the state information of an older node. Only after all sessions active at the time of reboot have terminated, state information is guaranteed to be resynchronized. \item \textit{Only possible with shared medium}. The practical implication is that no switched ethernet (and thus no full duplex) can be used. \end{itemize} The major advantage of the poor man's approach is implementation simplicity. No state transfer mechanism needs to be developed. Only very little changes to the existing conntrack code would be needed in order to be able to do tracking based on packets received from promiscuous interfaces. The active node would have packet forwarding turned on, the passive nodes, off. I'm not proposing this as a real solution to the failover problem. It's hackish, buggy, and likely to break very easily. But considering it can be implemented in very little programming time, it could be an option for very small installations with low reliability criteria. \subsection{Conntrack state replication} The preferred solution to the failover problem is, without any doubt, replication of the connection tracking state. The proposed conntrack state replication soltution consists of several parts: \begin{itemize} \item A connection tracking state replication protocol \item An event interface generating event messages as soon as state information changes on the active node \item An interface for explicit generation of connection tracking table entries on the standby slaves \item Some code (preferrably a kernel thread) running on the active node, receiving state updates by the event interface and generating conntrack state replication protocol messages \item Some code (preferrably a kernel thread) running on the slave node(s), receiving conntrack state replication protocol messages and updating the local conntrack table accordingly \end{itemize} Flow of events in chronological order: \begin{itemize} \item \textit{on active node, inside the network RX softirq} \begin{itemize} \item \ident{ip_conntrack} analyzes a forwarded packet \item \ident{ip_conntrack} gathers some new state information \item \ident{ip_conntrack} updates conntrack hash table \item \ident{ip_conntrack} calls event API \item function registered to event API builds and enqueues message to send ring \end{itemize} \item \textit{on active node, inside the conntrack-sync sender kernel thread} \begin{itemize} \item \ident{ct_sync_send} aggregates multiple messages into one packet \item \ident{ct_sync_send} dequeues packet from ring \item \ident{ct_sync_send} sends packet via in-kernel sockets API \end{itemize} \item \textit{on slave node(s), inside network RX softirq} \begin{itemize} \item \ident{ip_conntrack} ignores packets coming from the \ident{ct_sync} interface via NOTRACK mechanism \item UDP stack appends packet to socket receive queue of \ident{ct_sync_recv} kernel thread \end{itemize} \item \textit{on slave node(s), inside conntrack-sync receive kernel thread} \begin{itemize} \item \ident{ct_sync_recv} thread receives state replication packet \item \ident{ct_sync_recv} thread parses packet into individual messages \item \ident{ct_sync_recv} thread creates/updates local \ident{ip_conntrack} entry \end{itemize} \end{itemize} \subsubsection{Connection tracking state replication protocol} In order to be able to replicate the state between two or more firewalls, a state replication protocol is needed. This protocol is used over a private network segment shared by all nodes for state replication. It is designed to work over IP unicast and IP multicast transport. IP unicast will be used for direct point-to-point communication between one active firewall and one standby firewall. IP multicast will be used when the state needs to be replicated to more than one standby firewall. The principal design criteria of this protocol are: \begin{itemize} \item \textbf{reliable against data loss}, as the underlying UDP layer only provides checksumming against data corruption, but doesn't employ any means against data loss \item \textbf{lightweight}, since generating the state update messages is already a very expensive process for the sender, eating additional CPU, memory, and IO bandwith. \item \textbf{easy to parse}, to minimize overhead at the receiver(s) \end{itemize} The protocol does not employ any security mechanism like encryption, authentication, or reliability against spoofing attacks. It is assumed that the private conntrack sync network is a secure communications channel, not accessible to any malicious third party. To achieve the reliability against data loss, an easy sequence numbering scheme is used. All protocol messages are prefixed by a sequence number, determined by the sender. If the slave detects packet loss by discontinuous sequence numbers, it can request the retransmission of the missing packets by stating the missing sequence number(s). Since there is no acknowledgement for sucessfully received packets, the sender has to keep a reasonably-sized\footnote{\textit{reasonable size} must be large enough for the round-trip time between master and slowest slave.} backlog of recently-sent packets in order to be able to fulfill retransmission requests. The different state replication protocol packet types are: \begin{itemize} \item \textbf{\ident{CT_SYNC_PKT_MASTER_ANNOUNCE}}: A new master announces itself. Any still existing master will downgrade itself to slave upon reception of this packet. \item \textbf{\ident{CT_SYNC_PKT_SLAVE_INITSYNC}}: A slave requests initial synchronization from the master (after reboot or loss of sync). \item \textbf{\ident{CT_SYNC_PKT_SYNC}}: A packet containing synchronization data from master to slaves \item \textbf{\ident{CT_SYNC_PKT_NACK}}: A slave indicates packet loss of a particular sequence number \end{itemize} The messages within a \lident{CT_SYNC_PKT_SYNC} packet always refer to a particular \textit{resource} (currently \lident{CT_SYNC_RES_CONNTRACK} and \lident{CT_SYNC_RES_EXPECT}, although support for the latter has not been fully implemented yet). For every resource, there are several message types. So far, only \lident{CT_SYNC_MSG_UPDATE} and \lident{CT_SYNC_MSG_DELETE} have been implemented. This means a new connection as well as state changes to an existing connection will always be encapsulated in a \lident{CT_SYNC_MSG_UDPATE} message and therefore contain the full conntrack entry. To uniquely identify (and later reference) a conntrack entry, the only unique criteria is used: \ident{ip_conntrack_tuple}. \subsubsection{\texttt{ct\_sync} sender thread} Maximum care needs to be taken for the implementation of the ctsyncd sender. The normal workload of the active firewall node is likely to be already very high, so generating and sending the conntrack state replication messages needs to be highly efficient. It was therefore decided to use a pre-allocated ringbuffer for outbound \ident{ct_sync} packets. New messages are appended to individual buffers in this ring, and pointers into this ring are passed to the in-kernel sockets API to ensure a minimum number of copies and memory allocations. \subsubsection{\texttt{ct\_sync} initsync sender thread} In order to facilitate ongoing state synchronization at the same time as responding to initial sync requests of an individual slave, the sender has a separate kernel thread for initial state synchronization (and \ident{ct_sync_initsync}). At the moment it iterates over the state table and transmits packets with a fixed rate of about 1000 packets per second, resulting in about 4000 connections per second, averaging to about 1.5 Mbps of bandwith consumed. The speed of this initial sync should be configurable by the system administrator, especially since there is no flow control mechanism, and the slave node(s) will have to deal with the packets or otherwise lose sync again. This is certainly an area of future improvement and development---but first we want to see practical problems with this primitive scheme. \subsubsection{\texttt{ct\_sync} receiver thread} Implementation of the receiver is very straightforward. For performance reasons, and to facilitate code-reuse, the receiver uses the same pre-allocated ring buffer structure as the sender. Incoming packets are written into ring members and then successively parsed into their individual messages. Apart from dealing with lost packets, it just needs to call the respective conntrack add/modify/delete functions. \subsubsection{Necessary changes within netfilter conntrack core} To be able to achieve the described conntrack state replication mechanism, the following changes to the conntrack core were implemented: \begin{itemize} \item Ability to exclude certain packets from being tracked. This was a long-wanted feature on the TODO list of the netfilter project and is implemented by having a ``raw'' table in combination with a ``NOTRACK'' target. \item Ability to register callback functions to be called every time a new conntrack entry is created or an existing entry modified. This is part of the nfnetlink-ctnetlink patch, since the ctnetlink event interface also uses this API. \item Export an API to externally add, modify, and remove conntrack entries. \end{itemize} Since the number of changes is very low, their inclusion into the mainline kernel is not a problem and can happen during the 2.6.x stable kernel series. \subsubsection{Layer 2 dropping and \texttt{ct\_sync}} In most cases, netfilter/iptables-based firewalls will not only function as packet filter but also run local processes such as proxies, dns relays, smtp relays, etc. In order to minimize failover time, it is helpful if the full startup and configuration of all network interfaces and all of those userspace processes can happen at system bootup time rather then in the instance of a failover. l2drop provides a convenient way for this goal: It hooks into layer 2 netfilter hooks (immediately attached to \ident{netif_rx()} and \ident{dev_queue_xmit}) and blocks all incoming and outgoing network packets at this very low layer. Even kernel-generated messages such as ARP replies, IPv6 neighbour discovery, IGMP, \dots are blocked this way. Of course there has to be an exemption for the state synchronization messages themselves. In order to still facilitate remote administration via SSH and other communication between the cluster nodes, the whole network interface used for synchronization is subject to this exemption from l2drop. As soon as a node is propagated to master state, l2drop is disabled and the system becomes visible to the network. \subsubsection{Configuration} All configuration happens via module parameters. \begin{itemize} \item \texttt{syncdev}: Name of the multicast-capable network device used for state synchronization among the nodes \item \texttt{state}: Initial state of the node (0=slave, 1=master) \item \texttt{id}: Unique Node ID (0..255) \item \texttt{l2drop}: Enable (1) or disable (0) the l2drop functionality \end{itemize} \subsubsection{Interfacing with the cluster manager} As indicated in the beginning of this paper, \ident{ct_sync} itself does not provide any mechanism to determine outage of the master node within a cluster. This job is left to a cluster manager software running in userspace. Once an outage of the master is detected, the cluster manager needs to elect one of the remaining (slave) nodes to become new master. On this elected node, the cluster manager will write the ascii character \texttt{1} into the \ident{/proc/net/ct_sync} file. Reading from this file will return the current state of the local node. \section{Acknowledgements} The author would like to thank his fellow netfilter developers for their help. Particularly important to \ident{ct_sync} is Krisztian KOVACS \ident{}, who did a proof-of-concept implementation based on my first paper on \ident{ct_sync} at OLS2002. Without the financial support of Astaro AG, I would not have been able to spend any time on \ident{ct_sync} at all. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \end{document}