\documentclass[twocolumn]{article} \usepackage{ols} \begin{document} \date{} \title{\Large \bf How to replicate the fire - HA for netfilter based firewalls} \author{ Harald Welte\\ {\em Netfilter Core Team + Astaro AG}\\ {\normalsize laforge@gnumonks.org/laforge@astaro.com, http://www.gnumonks.org/} } \maketitle \thispagestyle{empty} \subsection*{Abstract} With traditional, stateless firewalling (such as ipfwadm, ipchains) there is no need for special HA support in the firewalling subsystem. As long as all packet filtering rules and routing table entries are configured in exactly the same way, one can use any available tool for IP-Address takeover to accomplish the goal of failing over from one node to the other. With Linux 2.4.x netfilter/iptables, the Linux firewalling code moves beyond traditional packet filtering. Netfilter provides a modular connection tracking susbsystem which can be employed for stateful firewalling. The connection tracking subsystem gathers information about the state of all current network flows (connections). Packet filtering decisions and NAT information is associated with this state information. In a high availability scenario, this connection tracking state needs to be replicated from the currently active firewall node to all standby slave firewall nodes. Only when all connection tracking state is replicated, the slave node will have all necessarry state information at the time a failover event occurs. The netfilter/iptables does currently not have any functionality for replicating connection tracking state accross multiple nodes. However, the author of this presentation, Harald Welte, has started a project for connection tracking state replication with netfilter/iptables. The presentation will cover the architectural design and implementation of the connection tracking failover sytem. With respect to the date of the conference, it is to be expected that the project is still a work-in-progress at that time. \section{Failover of stateless firewalls} There are no special precautions when installing a highly available stateless packet filter. Since there is no state kept, all information needed for filtering is the ruleset and the individual, seperate packets. Building a set of highly available stateless packet filters can thus be achieved by using any traditional means of IP-address takeover, such as Hartbeat or VRRPd. The only remaining issue is to make sure the firewalling ruleset is exactly the same on both machines. This should be ensured by the firewall administrator every time he updates the ruleset. If this is not applicable, because a very dynamic ruleset is employed, one can build a very easy solution using iptables-supplied tools iptables-save and iptables-restore. The output of iptables-save can be piped over ssh to iptables-restore on a different host. Limitations \begin{itemize} \item no state tracking \item not possible in combination with NAT \item no counter consistency of per-rule packet/byte counters \end{itemize} \section{Failover of stateful firewalls} Modern firewalls implement state tracking (aka connection tracking) in order to keep some state about the currently active sessions. The amount of per-connection state kept at the firewall depends on the particular implementation. As soon as {\bf any} state is kept at the packet filter, this state information needs to be replicated to the slave/backup nodes within the failover setup. In Linux 2.4.x, all relevant state is kept within the {\it connection tracking subsystem}. In order to understand how this state could possibly be replicated, we need to understand the architecture of this conntrack subsystem. \subsection{Architecture of the Linux Connection Tracking Subsystem} Connection tracking within Linux is implemented as a netfilter module, called ip\_conntrack.o. Before describing the connection tracking subsystem, we need to describe a couple of definitions and primitives used throughout the conntrack code. A connection is represented within the conntrack subsystem using {\it struct ip\_conntrack}, also called {\it connection tracking entry}. Connection tracking is utilizing {\it conntrack tuples}, which are tuples consisting out of (srcip, srcport, dstip, dstport, l4prot). A connection is uniquely identified by two tuples: The tuple in the original direction (IP\_CT\_DIR\_ORIGINAL) and the tuple for the reply direction (IP\_CT\_DIR\_REPLY). Connection tracking itself does not drop packets\footnote{well, in some rare cases in combination with NAT it needs to drop. But don't tell anyone, this is secret.} or impose any policy. It just associates every packet with a connection tracking entry, which in turn has a particular state. All other kernel code can use this state information\footnote{state information is internally represented via the {\it struct sk\_buff.nfct} structure member of a packet.}. \subsubsection{Integration of conntrack with netfilter} If the ip\_conntrack.o module is registered with netfilter, it attaches to the NF\_IP\_PRE\_ROUTING, NF\_IP\_POST\_ROUTING, NF\_IP\_LOCAL\_IN and NF\_IP\_LOCAL\_OUT hooks. Because forwarded packets are the most common case on firewalls, I will only describe how connection tracking works for forwarded packets. The two relevant hooks for forwarded packets are NF\_IP\_PRE\_ROUTING and NF\_IP\_POST\_ROUTING. Every time a packet arrives at the NF\_IP\_PRE\_ROUTING hook, connection tracking creates a conntrack tuple from the packet. It then compares this tuple to the original and reply tuples of all already-seen connections \footnote{Of course this is not implemented as a linear search over all existing connections.} to find out if this just-arrived packet belongs to any existing connection. If there is no match, a new conntrack table entry (struct ip\_conntrack) is created. Let's assume the case where we have already existing connections but are starting from scratch. The first packet comes in, we derive the tuple from the packet headers, look up the conntrack hash table, don't find any matching entry. As a result, we create a new struct ip\_conntrack. This struct ip\_conntrack is filled with all necessarry data, like the original and reply tuple of the connection. How do we know the reply tuple? By inverting the source and destination parts of the original tuple.\footnote{So why do we need two tuples, if they can be derived from each other? Wait until we discuss NAT.} Please note that this new struct ip\_conntrack is {\bf not} yet placed into the conntrack hash table. The packet is now passed on to other callback functions which have registered with a lower priority at NF\_IP\_PRE\_ROUTING. It then continues traversal of the network stack as usual, including all respective netfilter hooks. If the packet survives (i.e. is not dropped by the routing code, network stack, firewall ruleset, ...), it re-appears at NF\_IP\_POST\_ROUTING. In this case, we can now safely assume that this packet will be sent off on the outgoing interface, and thus put the connection tracking entry which we created at NF\_IP\_PRE\_ROUTING into the conntrack hash table. This process is called {\it confirming the conntrack}. The connection tracking code itself is not monolithic, but consists out of a couple of seperate modules\footnote{They don't actually have to be seperate kernel modules; e.g. TCP, UDP and ICMP tracking modules are all part of the linux kernel module ip\_conntrack.o}. Besides the conntrack core, there are two important kind of modules: Protocol helpers and application helpers. Protocol helpers implement the layer-4-protocol specific parts. They currently exist for TCP, UDP and ICMP (an experimental helper for GRE exists). \subsubsection{TCP connection tracking} As TCP is a connection oriented protocol, it is not very difficult to imagine how conntection tracking for this protocol could work. There are well-defined state transitions possible, and conntrack can decide which state transitions are valid within the TCP specification. In reality it's not all that easy, since we cannot assume that all packets that pass the packet filter actually arrive at the receiving end, ... It is noteworthy that the standard connection tracking code does {\bf not} do TCP sequence number and window tracking. A well-maintained patch to add this feature exists almost as long as connection tracking itself. It will be integrated with the 2.5.x kernel. The problem with window tracking is it's bad interaction with connection pickup. The TCP conntrack code is able to pick up already existing connections, e.g. in case your firewall was rebooted. However, connection pickup is conflicting with TCP window tracking: The TCP window scaling option is only transferred at connection setup time, and we don't know about it in case of pickup... \subsubsection{ICMP tracking} ICMP is not really a connection oriented protocol. So how is it possible to do connection tracking for ICMP? The ICMP protocol can be split in two groups of messages \begin{itemize} \item ICMP error messages, which sort-of belong to a different connection ICMP error messages are associated {\it RELATED} to a different connection. (ICMP\_DEST\_UNREACH, ICMP\_SOURCE\_QUENCH, ICMP\_TIME\_EXCEEDED, ICMP\_PARAMETERPROB, ICMP\_REDIRECT). \item ICMP queries, which have a request->reply character. So what the conntrack code does, is let the request have a state of {\it NEW}, and the reply {\it ESTABLISHED}. The reply closes the connection immediately. (ICMP\_ECHO, ICMP\_TIMESTAMP, ICMP\_INFO\_REQUEST, ICMP\_ADDRESS) \end{itemize} \subsubsection{UDP connection tracking} UDP is designed as a connectionless datagram protocol. But most common protocols using UDP as layer 4 protocol have bi-directional UDP communication. Imagine a DNS query, where the client sends an UDP frame to port 53 of the nameserver, and the nameserver sends back a DNS reply packet from it's UDP port 53 to the client. Netfilter trats this as a connection. The first packet (the DNS request) is assigned a state of {\it NEW}, because the packet is expected to create a new 'connection'. The dns servers' reply packet is marked as {\it ESTABLISHED}. \subsubsection{conntrack application helpers} More complex application protocols involving multiple connections need special support by a so-called ``conntrack application helper module''. Modules in the stock kernel come for FTP and IRC(DCC). Netfilter CVS currently contains patches for PPTP, H.323, Eggdrop botnet, tftp ald talk. We're still lacking a lot of protocols (e.g. SIP, SMB/CIFS) - but they are unlikely to appear until somebody really needs them and either develops them on his own or funds development. \subsubsection{Integration of connection tracking with iptables} As stated earlier, conntrack doesn't impose any policy on packets. It just determines the relation of a packet to already existing connections. To base packet filtering decision on this sate information, the iptables {\it state} match can be used. Every packet is within one of the following categories: \begin{itemize} \item {\bf NEW}: packet would create a new connection, if it survives \item {\bf ESTABLISHED}: packet is part of an already established connection (either direction) \item {\bf RELATED}: packet is in some way related to an already established connection, e.g. ICMP errors or FTP data sessions \item {\bf INVALID}: conntrack is unable to derive conntrack information from this packet. Please note that all multicast or broadcast packets fall in this category. \end{itemize} \subsection{Poor man's conntrack failover} When thinking about failover of stateful firewalls, one usually thinks about replication of state. This presumes that the state is gathered at one firewalling node (the currently active node), and replicated to several other passive standby nodes. There is, howeve, a very different approach to replication: concurrent state tracking on all firewalling nodes. The basic assumption of this approach is: In a setup where all firewalling nodes receive exactly the same traffic, all nodes will deduct the same state information. The implementability of this approach is totally dependent on fulfillment of this assumption. \begin{itemize} \item {\it All packets need to be seen by all nodes}. This is not always true, but can be achieved by using shared media like traditional ethernet (no switches!!) and promiscuous mode on all ethernet interfaces. \item {\it All nodes need to be able to process all packets}. This cannot be universally guaranteed. Even if the hardware (CPU, RAM, Chipset, NIC's) and software (Linux kernel) are exactly the same, they might behave different, especially under high load. To avoid those effects, the hardware should be able to deal with way more traffic than seen during operation. Also, there should be no userspace processes (like proxes, etc.) running on the firewalling nodes at all. WARNING: Nobody guarantees this behaviour. However, the poor man is usually not interested in scientific proof but in usability in his particular practical setup. \end{itemize} However, even if those conditions are fulfilled, ther are remaining issues: \begin{itemize} \item {\it No resynchronization after reboot}. If a node is rebooted (because of a hardware fault, software bug, software update, ..) it will loose all state information until the event of the reboot. This means, the state information of this node after reboot will not contain any old state, gathered before the reboot. The effect depend on the traffic. Generally, it is only assured that state information about all connections initiated after the reboot will be present. If there are short-lived connections (like http), the state information on the just rebooted node will approximate the state information of an older node. Only after all sessions active at the time of reboot have terminated, state information is guaranteed to be resynchronized. \item {\it Only possible with shared medium}. The practical implication is that no switched ethernet (and thus no full duplex) can be used. \end{itemize} The major advantage of the poor man's approach is implementation simplicity. No state transfer mechanism needs to be developed. Only very little changes to the existing conntrack code would be needed in order to be able to do tracking based on packets received from promiscuous interfaces. The active node would have packet forwarding turned on, the passive nodes off. I'm not proposing this as a real solution to the failover problem. It's hackish, buggy and likely to break very easily. But considering it can be implemented in very little programming time, it could be an option for very small installations with low reliability criteria. \subsection{Conntrack state replication} The preferred solution to the failover problem is, without any doubt, replication of the connection tracking state. The proposed conntrack state replication soltution consists out of several parts: \begin{itemize} \item A connection tracking state replication protocol \item An event interface generating event messages as soon as state information changes on the active node \item An interface for explicit generation of connection tracking table entries on the standby slaves \item Some code (preferrably a kernel thread) running on the active node, receiving state updates by the event interface and generating conntrack state replication protocol messages \item Some code (preferrably a kernel thread) running on the slave node(s), receiving conntrack state replication protocol messages and updating the local conntrack table accordingly \end{itemize} Flow of events in chronological order: \begin{itemize} \item {\it on active node, inside the network RX softirq} \begin{itemize} \item connection tracking code is analyzing a forwarded packet \item connection tracking gathers some new state information \item connection tracking updates local connection tracking database \item connection tracking sends event message to event API \end{itemize} \item {\it on active node, inside the conntrack-sync kernel thread} \begin{itemize} \item conntrack sync daemon receives event through event API \item conntrack sync daemon aggregates multiple event messages into a state replication protocol message, removing possible redundancy \item conntrack sync daemon generates state replication protocol message \item conntrack sync daemon sends state replication protocol message (private network between firewall nodes) \end{itemize} \item {\it on slave node(s), inside network RX softirq} \begin{itemize} \item connection tracking code ignores packets coming from the interface attached to the private conntrac sync network \item state replication protocol messages is appended to socket receive queue of conntrack-sync kernel thread \end{itemize} \item {\it on slave node(s), inside conntrack-sync kernel thread} \begin{itemize} \item conntrack sync daemon receives state replication message \item conntrack sync daemon creates/updates conntrack entry \end{itemize} \end{itemize} \subsubsection{Connection tracking state replication protocol} In order to be able to replicate the state between two or more firewalls, a state replication protocol is needed. This protocol is used over a private network segment shared by all nodes for state replication. It is designed to work over IP unicast and IP multicast transport. IP unicast will be used for direct point-to-point communication between one active firewall and one standby firewall. IP multicast will be used when the state needs to be replicated to more than one standby firewall. The principle design criteria of this protocol are: \begin{itemize} \item {\bf reliable against data loss}, as the underlying UDP layer does only provide checksumming against data corruption, but doesn't employ any means against data loss \item {\bf lightweight}, since generating the state update messages is already a very expensive process for the sender, eating additional CPU, memory and IO bandwith. \item {\bf easy to parse}, to minimize overhead at the receiver(s) \end{itemize} The protocol does not employ any security mechanism like encryption, authentication or reliability against spoofing attacks. It is assumed that the private conntrack sync network is a secure communications channel, not accessible to any malicious 3rd party. To achieve the reliability against data loss, an easy sequence numbering scheme is used. All protocol messages are prefixed by a seuqence number, determined by the sender. If the slave detects packet loss by discontinuous sequence numbers, it can request the retransmission of the missing packets by stating the missing sequence number(s). Since there is no acknowledgement for sucessfully received packets, the sender has to keep a reasonably-sized backlog of recently-sent packets in order to be able to fulfill retransmission requests. The different state replication protocol messages types are: \begin{itemize} \item {\bf NF\_CTSRP\_NEW}: New conntrack entry has been created (and confirmed\footnote{See the above description of the conntrack code for what is meant by {\it confirming} a conntrack entry}) \item {\bf NF\_CTSRP\_UPDATE}: State information of existing conntrack entry has changed \item {\bf NF\_CTSRP\_EXPIRE}: Existing conntrack entry has been expired \end{itemize} To uniquely identify (and later reference) a conntrack entry, a {\it conntrack\_id} is assigned to every conntrack entry transferred using a NF\_CTSRP\_NEW message. This conntrack\_id must be saved at the receiver(s) together with the conntrack entry, since it is used by the sender for subsequent NF\_CTSRP\_UPDATE and NF\_CTSRP\_EXPIRE messages. The protocol itself does not care about the source of this conntrack\_id, but since the current netfilter connection tracking implementation does never change the addres of a conntrack entry, the memory address of the entry can be used, since it comes for free. \subsubsection{Connection tracking state syncronization sender} Maximum care needs to be taken for the implementation of the ctsyncd sender. The normal workload of the active firewall node is likely to be already very high, so generating and sending the conntrack state replication messages needs to be highly efficient. \begin{itemize} \item {\bf NF\_CTSRP\_NEW} will be generated at the NF\_IP\_POST\_ROUTING hook, at the time ip\_conntrack\_confirm() is called. Delaying this message until conntrack confirmation happens saves us from replicating otherwise unneeded state information. \item {\bf NF\_CTSRP\_UPDATE} need to be created automagically by the conntrack core. It is not possible to have any failover-specific code within conntrack protocol and/or application helpers. The easiest way involving the least changes to the conntrack core code is to copy parts of the conntrack entry before calling any helper functions, and then use memcmp() to find out if the helper has changed any information. \item {\bf NF\_CTSRP\_EXPIRE} can be added very easily to the existing conntrack destroy function. \end{itemize} \subsubsection{Connection tracking state syncronization receiver} Impmentation of the receiver is very straightforward. Apart from dealing with lost CTSRP packets, it just needs to call the respective conntrack add/modify/delete functions offered by the core. \subsubsection{Necessary changes within netfilter conntrack core} To be able to implement the described conntrack state replication mechanism, the following changes to the conntrack core are needed: \begin{itemize} \item Ability to exclude certain packets from being tracked. This is a long-wanted feature on the TODO list of the netfilter project and will be implemented by having a ``prestate'' table in combination with a ``NOTRACK'' target. \item Ability to register callback functions to be called every time a new conntrack entry is created or an existing entry modified. \item Export an API to add externally add, modify and remove conntrack entries. Since the needed ip\_conntrack\_lock is exported, implementation could even reside outside the conntrack core code. \end{itemize} Since the number of changes is very low, it is very likely that the modifications will go into the mainstream kernel without any big hazzle. \end{document}