2005/netfilter_nextgen-lk2005/netfilter_nextgen-lk2005.xml


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341

<?xml version='1.0' encoding='ISO-8859-1'?>
<!DOCTYPE article PUBLIC '-//OASIS//DTD DocBook XML V4.3//EN' 'http://www.docbook.org/xml/4.3/docbookx.dtd'>

<article id="rfid_introduction-ds">

<articleinfo>
	<title>First steps towards the next generation netfilter subsystem</title>
	<authorgroup>
		<author>
			<personname>
				<firstname>Harald</firstname>
				<surname>Welte</surname>
			</personname>
		<!--
			<personblurb>Harald Welte</personblurb>
				<affiliation>
					<orgname>netfilter core team</orgname>
					<address>
						<email>laforge@netfilter.org</email>
					</address>
				</affiliation>

			-->
			<email>laforge@netfilter.org</email>
		</author>
		</authorgroup>
	<copyright>
		<year>2005</year>
		<holder>Harald Welte &lt;laforge@netfilter.org&gt; </holder>
	</copyright>
	<date>Sep 21, 2005</date>
	<edition>1</edition>
	<!-- <orgname>netfilter core team</orgname> -->
	<releaseinfo>
		1.0
	</releaseinfo>

	<abstract>

<para>
Until 2.6, every new kernel version came with its own incarnation of a packet
filter: ipfw, ipfwadm, ipchains, iptables. 2.6.x still had iptables. What was
wrong? Or was iptables good enough to last even two generations?
</para>
<para>
In reality the netfilter project is working on gradually transforming the
existing framework into something new. Some of those changes are transparent to
the user, so they slip into a kernel release almost unnoticed. However, for
expert users and developers those changes are noteworthy anyway.
</para>
<para>
Some other changes just extend the existing framework, so most users again
won't even notice them - they just don't take advantage of those new features.
</para>
<para>
The 2.6.14 kernel release will mark a milestone, since it is scheduled to
contain nfnetlink, ctnetlink, nfnetlink_queue and nfnetlink_log - basically a
totally new netlink-based kernel/userspace interface for most parts of the
netfilter subsystem.
</para>
<para>
nf_conntrack, a generic layer-3 independent connection tracking subsystem,
initially supporting IPv4 and IPv6, is also in the queue of pending patches.
Chances are high that it will be included in the mainline kernel at the time
this paper is presented at Linux Kongress.
</para>
<para>
Another new subsystem within the framework is the "ipset" filter, basically an
alternative to using iptables in certain areas.
</para>
<para>
The presentation (but not this paper) will also summarize the results of the
annual netfilter development workshop, which is scheduled just the week before
Linux Kongress. 
</para>
	</abstract>

</articleinfo>

<section>
<title>nfnetlink</title>
<para>
In the current (pre-2.6.14) linux kernel, there is no unified communications
infrastructure used by all parts of the netfilter/iptables subsystem.  Some
parameters can be read from /proc, some can be set via sysctl, some as module
load time parameters.  The iptables configuraiton happens via get/setsockopt,
and the userspace queueing and logging use two separate (scarce) netlink family
numbers.
</para>
<para>
Most of the network stack is controlled via netlink.  Examples are routing
tables, routing policy, interface configuration, traffic control and ipsec.
</para>
<para>
nfnetlink is the answer for all netfilter-related kernel/userspace interaction.
It provides a thin layer on top of netlink.  The nfnetlink code in the kernel
has its userspace counterpart called "libnfnetlink".
</para>
</section>

<section>
<title>conntrack event API</title>
<para>
For some applications (such as state replication or flow-based accounting) it
is interesting to learn about conntrack state changes.
</para>
<para>
The new conntrack event API provides in-kernel notification of conntrack event changes via a standard <structname>notifier_chain</structname>.
</para>
</section>

<section>
<title>nfnetlink_conntrack (aka ctnetlink)</title>
<para>
nfnetlink_conntrack is a nfnetlink-based interface for reading, dumping and
manipulating connection tracking state from userspace.
</para>
<para>
The most straight-forward application is to obtain a list of currently tracked
connections.  In pre-2.6.14 kernels, this can only be via the ugly
<filename>/proc/net/ip_conntrack</filename> virtual file.  The file-based
access is slow, unreliable, suboptimal and doesn't allow for efficient
searching.
</para>
<para>
However, certain monitoring applications or e.g. a NAT-aware identd
implementation have demand for efficient fine-grained access.
</para>
<para>
Also, the administrator might want to selectively delete connection tracking
entries, or even flush the whole table.  In pre-2.6.14, there i no intrface for
that apart from the "rmmod ip_conntrack; modprobe ip_conntrack" kludge.
</para>
<para>
Addidional (future) users of ctnetlink are connection tracking helpers in
userspace.  Imagine something like a hybrid between transparent proxying and
the current in-kernel helpers.  Get the features of running insensitive
userspace code that cannot crash your kernel, and still retain the benefits of
e.g. not having to do userspace processing on ftp data (but only control)
packets.
</para>
</section>

<section>
<title>libnfnetlink_conntrack</title>
<para>
libnfnetlink_conntrack is the userspace counterpart to nfnetlink_conntrack
inside the kernel.  It constructs and parses nfnetlink packets and thus
provides a "function and struct" style C API.
</para>
</section>

<section>
<title>The "conntrack" program</title>
<para>
The <command>conntrack</command> command is a userspace program linked against
libnfnetlink_conntrack.  It allows commandline-level acces to the connection
tracking table.
</para>
<para>
<command>conntrack</command> supports listing, deleting, updating, flushing and
even creating connection tracking entries.  It also allows listing, deleting
and updating of conntrack expectations.
</para>
</section>

<section>
<title>nf_queue</title>
<para>
nf_queue is not really something new, but still very little people have known
it until now.  The 2.4.x netfilter subsystem first introduced a generic
packet queueing mechanism for asynchronously sending packets to userspace (and
reinjecting them or a verdict.  This mechanism is mostly known as ip_queue, or
the QUEUE target.
</para>
<para>
In reality, ip_queue sits in top of a small layer called nf_queue.  nf_queue
allows for one netfilter queue handler per network protocol family.  All
netfilter hooks within this protocol family that return the NF_QUEUE verdict
will send the packet to this nf_queue handler.
</para>
<para>
In the existing 2.4.x and pre-2.6.14 code, the mainline kernel only had one
queue handler: ip_queue.  This basically means that only IP packets could be
queued for an unserspace process.
</para>
<para>
Outside of the official kernel tree, a "copy+paste" port of ip_queue was made
to IPv6.  The netfilter/iptables project has had enough copy+paste style
"ports" due to architectural limitations.  Therefore the code was not accepted
into the mainline kernel.  Rather, work on a generic replacement was continued.
</para>
<para>
Which log handler is to be used for what protocol family can now be configured
via nfnetlink_queue (see below).  The current status can also be read from
<filename>/proc/net/netfilter/nf_queue</filename>.
</para>
</section>

<section>
<title>nfnetlink_queue</title>
<para>
nfnetlink_queue is a nfnetlink-based and layer 3 protocol independent
replacement of ip_queue.
</para>
<para>
It provides all features of ip_queue for packets independent of their protocol.
</para>
<para>
In addition to mere replication of ip_queue functionality, it fixes the most
funamental problem with the old ip_queue code:  That there was only one global
queue, and there could only be one userspace process attached to it.
</para>
<para>
nfnetlink_queue supports up to 65535 different dynamically-created queues.
Packets can be put into a specific queue by using the NFQUEUE target.  For
backwards compatibility, packets coming from the iptables QUEUE target will be
placed in queue number 0.
</para>
<para>
Userspace processes can now also receive additional packet metadata such as the
PHYSINDEV/PHYSOUTDEV devices in case of bridging.
</para>
</section>

<section>
<title>libnfnetlink_queue</title>
<para>
The library libnfnetlink_queue is the userspace counterpart to nfnetlink_queue
inside the kernel.  It provides an easy-to-use C language interface to packet
usrespace queueing.
</para>
<para>
For legacy applications using <filename>libipq</filename>, an API-compatible
(but not ABI-compatible) libipq replacement is available together with
libnfnetlink_queue.
</para>
</section>

<section>
<title>nf_log</title>
<para>
Traditionally, netfilter itself doesn't provide any packet logging
infrastructure.   Only iptables provides the LOG target (for klogd/syslogd
logging).  In 2001, the ULOG target was added to support more efficient logging
via a dedicated netlink socket.
</para>
<para>
When the TCP window tracking code was introduced, the requirement for
logging packets (such as TCP out of window packets) from non-iptables code 
became immediate.
</para>
<para>
Instead of a more generic solution, it was decided to have module load time
parameters (nf_log) decide whether ipt_LOG or ipt_ULOG register as "internal
logging backend" that can be used by conntrack.
</para>
<para>
In 2.6.14, nf_log became a first-class citizen.  This means that the iptables
LOG target doesn't do any direct logging.  Instead it registers as a nf_log
backend with the core, and calls the nf_log frontend when it wishes to log a
packet.
</para>
<para>
The nf_log core can then decide whether to log the packet using the ipt_LOG
provided syslog backend, or via old style ipt_ULOG netlink logging, or the
newly-introduced nfnetlink_log mechanism (see below).
</para>
<para>
Which log handler is to be used for what protocol family can be configured
via nfnetlink (see below).  The current status can also be read from
<filename>/proc/net/netfilter/nf_log</filename>.
</para>
</section>

<section>
<title>nfnetlink_log</title>
<para>
nfnetlink_log is for logging what nfnetlink_queue is for queueing.  It takes
the ideas of the ipt_ULOG target and reimplements them in a layer 3 protocol
independent fashion, as well as shifts the transport layer on top of nfnetlink.
</para>
<para>
ipt_ULOG already allowed for up to 32 logging groups, whcih seemed to be enough
in all practical cases.  To be more orthogonal to nfnetlink_queue,
nfnetlink_log now also suports 65535 logging groups, each of which can be
terminated by a different logging process.
</para>
</section>

<section>
<title>libnfnetlink_log</title>
<para>
Orthogonal to libnfnetlink_queue, libnfnetlink_log is the userspace counterpart
to nfnetlink_log in the kernel.
</para>
<para>
libnfnetlink_log also provides a libipulog backwards compatibility API.
</para>
</section>

<section>
<title>Flow based accounting</title>
<para>
The fundamental idea of flow-based (or more correctly: connection-based)
accounting is to keep per-connection byte an packet counters within the connection tracking table.
</para>
<para>
On firewall systems that already use ip_conntrack, keeping those per-connection
counters only adds very little overhead to the existing connection tracking,
and is thus almost free.
</para>
<para>
Internally, flow-based accounting uses both the conntrack event API and
nfnetlink_conntrack.
</para>
<para>
For a more detailed description of flow based accounting and the motivations
behind it, please refer to my paper on flow based accounting published in the
proceedings of Linuxtag 2005.
</para>
</section>

<section>
<title>nf_conntrack</title>
<para>
nf_conntrack is a generalized version of ip_conntrack.  This generalization is
required to provide connection tracking for non-ipv4 protcols.  Currently only
IPv4 and IPv6 are supported in nf_conntrack.
</para>
<para>
The architecture of nf_conntrack is almost exactly the same like ip_conntrack,
only
</para>
<para>
nf_conntrack is not in the 2.6.14 kernel series but will very likely be merged
during the early 2.6.15 development process.  The latest nf_conntrack version can be obtained from the netfilter-2.6 git tree.
</para>
</section>

</article>