First steps towards the next generation netfilter subsystem Harald Welte laforge@netfilter.org 2005 Harald Welte <laforge@netfilter.org> Sep 21, 2005 1 1.0 Until 2.6, every new kernel version came with its own incarnation of a packet filter: ipfw, ipfwadm, ipchains, iptables. 2.6.x still had iptables. What was wrong? Or was iptables good enough to last even two generations? In reality the netfilter project is working on gradually transforming the existing framework into something new. Some of those changes are transparent to the user, so they slip into a kernel release almost unnoticed. However, for expert users and developers those changes are noteworthy anyway. Some other changes just extend the existing framework, so most users again won't even notice them - they just don't take advantage of those new features. The 2.6.14 kernel release will mark a milestone, since it is scheduled to contain nfnetlink, ctnetlink, nfnetlink_queue and nfnetlink_log - basically a totally new netlink-based kernel/userspace interface for most parts of the netfilter subsystem. nf_conntrack, a generic layer-3 independent connection tracking subsystem, initially supporting IPv4 and IPv6, is also in the queue of pending patches. Chances are high that it will be included in the mainline kernel at the time this paper is presented at Linux Kongress. Another new subsystem within the framework is the "ipset" filter, basically an alternative to using iptables in certain areas. The presentation (but not this paper) will also summarize the results of the annual netfilter development workshop, which is scheduled just the week before Linux Kongress.
nfnetlink In the current (pre-2.6.14) linux kernel, there is no unified communications infrastructure used by all parts of the netfilter/iptables subsystem. Some parameters can be read from /proc, some can be set via sysctl, some as module load time parameters. The iptables configuraiton happens via get/setsockopt, and the userspace queueing and logging use two separate (scarce) netlink family numbers. Most of the network stack is controlled via netlink. Examples are routing tables, routing policy, interface configuration, traffic control and ipsec. nfnetlink is the answer for all netfilter-related kernel/userspace interaction. It provides a thin layer on top of netlink. The nfnetlink code in the kernel has its userspace counterpart called "libnfnetlink".
conntrack event API For some applications (such as state replication or flow-based accounting) it is interesting to learn about conntrack state changes. The new conntrack event API provides in-kernel notification of conntrack event changes via a standard notifier_chain.
nfnetlink_conntrack (aka ctnetlink) nfnetlink_conntrack is a nfnetlink-based interface for reading, dumping and manipulating connection tracking state from userspace. The most straight-forward application is to obtain a list of currently tracked connections. In pre-2.6.14 kernels, this can only be via the ugly /proc/net/ip_conntrack virtual file. The file-based access is slow, unreliable, suboptimal and doesn't allow for efficient searching. However, certain monitoring applications or e.g. a NAT-aware identd implementation have demand for efficient fine-grained access. Also, the administrator might want to selectively delete connection tracking entries, or even flush the whole table. In pre-2.6.14, there i no intrface for that apart from the "rmmod ip_conntrack; modprobe ip_conntrack" kludge. Addidional (future) users of ctnetlink are connection tracking helpers in userspace. Imagine something like a hybrid between transparent proxying and the current in-kernel helpers. Get the features of running insensitive userspace code that cannot crash your kernel, and still retain the benefits of e.g. not having to do userspace processing on ftp data (but only control) packets.
libnfnetlink_conntrack libnfnetlink_conntrack is the userspace counterpart to nfnetlink_conntrack inside the kernel. It constructs and parses nfnetlink packets and thus provides a "function and struct" style C API.
The "conntrack" program The conntrack command is a userspace program linked against libnfnetlink_conntrack. It allows commandline-level acces to the connection tracking table. conntrack supports listing, deleting, updating, flushing and even creating connection tracking entries. It also allows listing, deleting and updating of conntrack expectations.
nf_queue nf_queue is not really something new, but still very little people have known it until now. The 2.4.x netfilter subsystem first introduced a generic packet queueing mechanism for asynchronously sending packets to userspace (and reinjecting them or a verdict. This mechanism is mostly known as ip_queue, or the QUEUE target. In reality, ip_queue sits in top of a small layer called nf_queue. nf_queue allows for one netfilter queue handler per network protocol family. All netfilter hooks within this protocol family that return the NF_QUEUE verdict will send the packet to this nf_queue handler. In the existing 2.4.x and pre-2.6.14 code, the mainline kernel only had one queue handler: ip_queue. This basically means that only IP packets could be queued for an unserspace process. Outside of the official kernel tree, a "copy+paste" port of ip_queue was made to IPv6. The netfilter/iptables project has had enough copy+paste style "ports" due to architectural limitations. Therefore the code was not accepted into the mainline kernel. Rather, work on a generic replacement was continued. Which log handler is to be used for what protocol family can now be configured via nfnetlink_queue (see below). The current status can also be read from /proc/net/netfilter/nf_queue.
nfnetlink_queue nfnetlink_queue is a nfnetlink-based and layer 3 protocol independent replacement of ip_queue. It provides all features of ip_queue for packets independent of their protocol. In addition to mere replication of ip_queue functionality, it fixes the most funamental problem with the old ip_queue code: That there was only one global queue, and there could only be one userspace process attached to it. nfnetlink_queue supports up to 65535 different dynamically-created queues. Packets can be put into a specific queue by using the NFQUEUE target. For backwards compatibility, packets coming from the iptables QUEUE target will be placed in queue number 0. Userspace processes can now also receive additional packet metadata such as the PHYSINDEV/PHYSOUTDEV devices in case of bridging.
libnfnetlink_queue The library libnfnetlink_queue is the userspace counterpart to nfnetlink_queue inside the kernel. It provides an easy-to-use C language interface to packet usrespace queueing. For legacy applications using libipq, an API-compatible (but not ABI-compatible) libipq replacement is available together with libnfnetlink_queue.
nf_log Traditionally, netfilter itself doesn't provide any packet logging infrastructure. Only iptables provides the LOG target (for klogd/syslogd logging). In 2001, the ULOG target was added to support more efficient logging via a dedicated netlink socket. When the TCP window tracking code was introduced, the requirement for logging packets (such as TCP out of window packets) from non-iptables code became immediate. Instead of a more generic solution, it was decided to have module load time parameters (nf_log) decide whether ipt_LOG or ipt_ULOG register as "internal logging backend" that can be used by conntrack. In 2.6.14, nf_log became a first-class citizen. This means that the iptables LOG target doesn't do any direct logging. Instead it registers as a nf_log backend with the core, and calls the nf_log frontend when it wishes to log a packet. The nf_log core can then decide whether to log the packet using the ipt_LOG provided syslog backend, or via old style ipt_ULOG netlink logging, or the newly-introduced nfnetlink_log mechanism (see below). Which log handler is to be used for what protocol family can be configured via nfnetlink (see below). The current status can also be read from /proc/net/netfilter/nf_log.
nfnetlink_log nfnetlink_log is for logging what nfnetlink_queue is for queueing. It takes the ideas of the ipt_ULOG target and reimplements them in a layer 3 protocol independent fashion, as well as shifts the transport layer on top of nfnetlink. ipt_ULOG already allowed for up to 32 logging groups, whcih seemed to be enough in all practical cases. To be more orthogonal to nfnetlink_queue, nfnetlink_log now also suports 65535 logging groups, each of which can be terminated by a different logging process.
libnfnetlink_log Orthogonal to libnfnetlink_queue, libnfnetlink_log is the userspace counterpart to nfnetlink_log in the kernel. libnfnetlink_log also provides a libipulog backwards compatibility API.
Flow based accounting The fundamental idea of flow-based (or more correctly: connection-based) accounting is to keep per-connection byte an packet counters within the connection tracking table. On firewall systems that already use ip_conntrack, keeping those per-connection counters only adds very little overhead to the existing connection tracking, and is thus almost free. Internally, flow-based accounting uses both the conntrack event API and nfnetlink_conntrack. For a more detailed description of flow based accounting and the motivations behind it, please refer to my paper on flow based accounting published in the proceedings of Linuxtag 2005.
nf_conntrack nf_conntrack is a generalized version of ip_conntrack. This generalization is required to provide connection tracking for non-ipv4 protcols. Currently only IPv4 and IPv6 are supported in nf_conntrack. The architecture of nf_conntrack is almost exactly the same like ip_conntrack, only nf_conntrack is not in the 2.6.14 kernel series but will very likely be merged during the early 2.6.15 development process. The latest nf_conntrack version can be obtained from the netfilter-2.6 git tree.