First steps towards the next generation netfilter subsystem
Harald
Welte
laforge@netfilter.org
2005
Harald Welte <laforge@netfilter.org>
Sep 21, 2005
1
1.0
Until 2.6, every new kernel version came with its own incarnation of a packet
filter: ipfw, ipfwadm, ipchains, iptables. 2.6.x still had iptables. What was
wrong? Or was iptables good enough to last even two generations?
In reality the netfilter project is working on gradually transforming the
existing framework into something new. Some of those changes are transparent to
the user, so they slip into a kernel release almost unnoticed. However, for
expert users and developers those changes are noteworthy anyway.
Some other changes just extend the existing framework, so most users again
won't even notice them - they just don't take advantage of those new features.
The 2.6.14 kernel release will mark a milestone, since it is scheduled to
contain nfnetlink, ctnetlink, nfnetlink_queue and nfnetlink_log - basically a
totally new netlink-based kernel/userspace interface for most parts of the
netfilter subsystem.
nf_conntrack, a generic layer-3 independent connection tracking subsystem,
initially supporting IPv4 and IPv6, is also in the queue of pending patches.
Chances are high that it will be included in the mainline kernel at the time
this paper is presented at Linux Kongress.
Another new subsystem within the framework is the "ipset" filter, basically an
alternative to using iptables in certain areas.
The presentation (but not this paper) will also summarize the results of the
annual netfilter development workshop, which is scheduled just the week before
Linux Kongress.
nfnetlink
In the current (pre-2.6.14) linux kernel, there is no unified communications
infrastructure used by all parts of the netfilter/iptables subsystem. Some
parameters can be read from /proc, some can be set via sysctl, some as module
load time parameters. The iptables configuraiton happens via get/setsockopt,
and the userspace queueing and logging use two separate (scarce) netlink family
numbers.
Most of the network stack is controlled via netlink. Examples are routing
tables, routing policy, interface configuration, traffic control and ipsec.
nfnetlink is the answer for all netfilter-related kernel/userspace interaction.
It provides a thin layer on top of netlink. The nfnetlink code in the kernel
has its userspace counterpart called "libnfnetlink".
conntrack event API
For some applications (such as state replication or flow-based accounting) it
is interesting to learn about conntrack state changes.
The new conntrack event API provides in-kernel notification of conntrack event changes via a standard notifier_chain.
nfnetlink_conntrack (aka ctnetlink)
nfnetlink_conntrack is a nfnetlink-based interface for reading, dumping and
manipulating connection tracking state from userspace.
The most straight-forward application is to obtain a list of currently tracked
connections. In pre-2.6.14 kernels, this can only be via the ugly
/proc/net/ip_conntrack virtual file. The file-based
access is slow, unreliable, suboptimal and doesn't allow for efficient
searching.
However, certain monitoring applications or e.g. a NAT-aware identd
implementation have demand for efficient fine-grained access.
Also, the administrator might want to selectively delete connection tracking
entries, or even flush the whole table. In pre-2.6.14, there i no intrface for
that apart from the "rmmod ip_conntrack; modprobe ip_conntrack" kludge.
Addidional (future) users of ctnetlink are connection tracking helpers in
userspace. Imagine something like a hybrid between transparent proxying and
the current in-kernel helpers. Get the features of running insensitive
userspace code that cannot crash your kernel, and still retain the benefits of
e.g. not having to do userspace processing on ftp data (but only control)
packets.
libnfnetlink_conntrack
libnfnetlink_conntrack is the userspace counterpart to nfnetlink_conntrack
inside the kernel. It constructs and parses nfnetlink packets and thus
provides a "function and struct" style C API.
The "conntrack" program
The conntrack command is a userspace program linked against
libnfnetlink_conntrack. It allows commandline-level acces to the connection
tracking table.
conntrack supports listing, deleting, updating, flushing and
even creating connection tracking entries. It also allows listing, deleting
and updating of conntrack expectations.
nf_queue
nf_queue is not really something new, but still very little people have known
it until now. The 2.4.x netfilter subsystem first introduced a generic
packet queueing mechanism for asynchronously sending packets to userspace (and
reinjecting them or a verdict. This mechanism is mostly known as ip_queue, or
the QUEUE target.
In reality, ip_queue sits in top of a small layer called nf_queue. nf_queue
allows for one netfilter queue handler per network protocol family. All
netfilter hooks within this protocol family that return the NF_QUEUE verdict
will send the packet to this nf_queue handler.
In the existing 2.4.x and pre-2.6.14 code, the mainline kernel only had one
queue handler: ip_queue. This basically means that only IP packets could be
queued for an unserspace process.
Outside of the official kernel tree, a "copy+paste" port of ip_queue was made
to IPv6. The netfilter/iptables project has had enough copy+paste style
"ports" due to architectural limitations. Therefore the code was not accepted
into the mainline kernel. Rather, work on a generic replacement was continued.
Which log handler is to be used for what protocol family can now be configured
via nfnetlink_queue (see below). The current status can also be read from
/proc/net/netfilter/nf_queue.
nfnetlink_queue
nfnetlink_queue is a nfnetlink-based and layer 3 protocol independent
replacement of ip_queue.
It provides all features of ip_queue for packets independent of their protocol.
In addition to mere replication of ip_queue functionality, it fixes the most
funamental problem with the old ip_queue code: That there was only one global
queue, and there could only be one userspace process attached to it.
nfnetlink_queue supports up to 65535 different dynamically-created queues.
Packets can be put into a specific queue by using the NFQUEUE target. For
backwards compatibility, packets coming from the iptables QUEUE target will be
placed in queue number 0.
Userspace processes can now also receive additional packet metadata such as the
PHYSINDEV/PHYSOUTDEV devices in case of bridging.
libnfnetlink_queue
The library libnfnetlink_queue is the userspace counterpart to nfnetlink_queue
inside the kernel. It provides an easy-to-use C language interface to packet
usrespace queueing.
For legacy applications using libipq, an API-compatible
(but not ABI-compatible) libipq replacement is available together with
libnfnetlink_queue.
nf_log
Traditionally, netfilter itself doesn't provide any packet logging
infrastructure. Only iptables provides the LOG target (for klogd/syslogd
logging). In 2001, the ULOG target was added to support more efficient logging
via a dedicated netlink socket.
When the TCP window tracking code was introduced, the requirement for
logging packets (such as TCP out of window packets) from non-iptables code
became immediate.
Instead of a more generic solution, it was decided to have module load time
parameters (nf_log) decide whether ipt_LOG or ipt_ULOG register as "internal
logging backend" that can be used by conntrack.
In 2.6.14, nf_log became a first-class citizen. This means that the iptables
LOG target doesn't do any direct logging. Instead it registers as a nf_log
backend with the core, and calls the nf_log frontend when it wishes to log a
packet.
The nf_log core can then decide whether to log the packet using the ipt_LOG
provided syslog backend, or via old style ipt_ULOG netlink logging, or the
newly-introduced nfnetlink_log mechanism (see below).
Which log handler is to be used for what protocol family can be configured
via nfnetlink (see below). The current status can also be read from
/proc/net/netfilter/nf_log.
nfnetlink_log
nfnetlink_log is for logging what nfnetlink_queue is for queueing. It takes
the ideas of the ipt_ULOG target and reimplements them in a layer 3 protocol
independent fashion, as well as shifts the transport layer on top of nfnetlink.
ipt_ULOG already allowed for up to 32 logging groups, whcih seemed to be enough
in all practical cases. To be more orthogonal to nfnetlink_queue,
nfnetlink_log now also suports 65535 logging groups, each of which can be
terminated by a different logging process.
libnfnetlink_log
Orthogonal to libnfnetlink_queue, libnfnetlink_log is the userspace counterpart
to nfnetlink_log in the kernel.
libnfnetlink_log also provides a libipulog backwards compatibility API.
Flow based accounting
The fundamental idea of flow-based (or more correctly: connection-based)
accounting is to keep per-connection byte an packet counters within the connection tracking table.
On firewall systems that already use ip_conntrack, keeping those per-connection
counters only adds very little overhead to the existing connection tracking,
and is thus almost free.
Internally, flow-based accounting uses both the conntrack event API and
nfnetlink_conntrack.
For a more detailed description of flow based accounting and the motivations
behind it, please refer to my paper on flow based accounting published in the
proceedings of Linuxtag 2005.
nf_conntrack
nf_conntrack is a generalized version of ip_conntrack. This generalization is
required to provide connection tracking for non-ipv4 protcols. Currently only
IPv4 and IPv6 are supported in nf_conntrack.
The architecture of nf_conntrack is almost exactly the same like ip_conntrack,
only
nf_conntrack is not in the 2.6.14 kernel series but will very likely be merged
during the early 2.6.15 development process. The latest nf_conntrack version can be obtained from the netfilter-2.6 git tree.