2002/netfilter-internals-lt2002/netfilter-internals-lt2002.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537

\documentclass{article}
\usepackage{german}
\usepackage{fancyheadings}
\usepackage{a4}

\setlength{\oddsidemargin}{0in}
\setlength{\evensidemargin}{0in}
\setlength{\topmargin}{0.0in}
\setlength{\headheight}{0in}
\setlength{\headsep}{0in}
\setlength{\textwidth}{6.5in}
\setlength{\textheight}{9.5in}
\setlength{\parindent}{0in}
\setlength{\parskip}{0.05in}


\begin{document}
\title{Linux 2.4.x netfilter/iptables firewalling internals}

\author{Harald Welte\\
        laforge@gnumonks.org\\
	\copyright{}2002 H. Welte}

\date{25. April 2002}

\maketitle

\setcounter{section}{0}
\setcounter{subsection}{0}
\setcounter{subsubsection}{0}

\section{Introduction}
The Linux 2.4.x kernel series has introduced a totally new kernel firewalling
subsystem.  It is much more than a plain successor of ipfwadm or ipchains.

The netfilter/iptables project has a very modular design and it's
sub-projects can be split in several parts: netfilter, iptables, connection
tracking, NAT and packet mangling.

While most users will already have learned how to use the basic functions
of netfilter/iptables in order to convert their old ipchains firewalls to
iptables, there's more advanced but less used functionality in
netfilter/iptables.

The presentation covers the design principles behind the netfilter/iptables
implementation.  This knowledge enables us to understand how the individual
parts of netfilter/iptables fit together, and for which potential applications
this is useful.

\section{Internal netfilter/iptables architecture}

\subsection{Netfilter hooks in protocol stacks}

One of the major motivations behind the redesign of the linux packet
filtering and NAT system during the 2.3.x kernel series was the widespread
firewall specific code parts within the core IPv4 stack.  Ideally the core
IPv4 stack (as used by regular hosts and routers) shouldn't contain any
firewalling specific code, resulting in no unwanted interaction and less
code complexity.  This desire lead to the invention of {\it netfilter}.

\subsubsection{Architecture of netfilter}

Netfilter is basically a system of callback functions within the network
stack.  It provides a non-portable API towards in-kernel networking
extensions.
  
What we call {\it netfilter hook} is a well-defined call-out point within a
layer three protocol stack, such as IPv4, IPv6 or DECnet.  Any layer three
network stack can define an arbitrary number of hooks, usually placed at
strategic points within the packet flow.

Any other kernel code can now subsequently register callback functions for
any of these hooks.  As in most sytems will be more than one callback
function registered for a particular hook, a {\it priority} is specified upon
registration of the callback function.  This priority defines the order in
which the individual callback functions at a particular hook are called.

The return value of any registered callback functions can be:
\begin{itemize}
\item
{\bf NF\_ACCEPT}: continue traversal as usual
\item
{\bf NF\_DROP}: drop the packet; do not continue traversal
\item
{\bf NF\_STOLEN}: callback function has taken over the packet; do not continue
\item
{\bf NF\_QUEUE}: enqueue the packet to userspace
\item
{\bf NF\_REPEAT}: call this hook again
\end{itemize}

\subsubsection{Netfilter hooks within IPv4}

The IPv4 stack provides five netfilter hooks, which are placed at the
following peculiar places within the code:

\begin{verbatim}
   --->[1]--->[ROUTE]--->[3]--->[4]--->
                 |            ^
                 |            |
                 |         [ROUTE]
                 v            |
                [2]          [5]
                 |            ^
                 |            |
                 v            |

                local processes
\end{verbatim}

Packets received on any network interface arrive at the left side of the
diagram.  After the verification of the IP header checksum, the
NF\_IP\_PRE\_ROUTING [1] hook is traversed.  

If they ``survive'' (i.e.  NF\_ACCEPT is returned), the packet enters the
routing code.  Where we continue from here depends on the destintion of the
packet.

Packets with a local destination (i.e. packets where the destination address is
one of the own IP addresses of the host) traverse the NF\_IP\_LOCAL\_IN [2]
hook.  If all callback function return NF\_ACCEPT, the packet is finally passed
to the socket code, which eventually passes the packet to a local process.

Packets with a remote destination (i.e. packets which are forwarded by the
local machine) traverse the NF\_IP\_FORWARD [3] hook.  If they ``survive'',
they finally pass the NF\_IP\_POST\_ROUTING [4] hook and are sent off the
outgoing network interface.

Locally generated packets first traverse the NF\_IP\_LOCAL\_OUT [5] hook, then
enter the routing code, and finally go through the NF\_IP\_POST\_ROUTING [4]
hook before being sent off the outgoing network interface.

\subsubsection{Netfilter hooks within IPv6}

As the IPv4 and IPv6 protocols are very similar, the netfilter hooks within the
IPv6 stack are placed at exactly the same locations as in the IPv4 stack.  The
only change are the hook names: NF\_IP6\_PRE\_ROUTING, NF\_IP6\_LOCAL\_IN,
NF\_IP6\_FORWARD, NF\_IP6\_POST\_ROUTING, NF\_IP6\_LOCAL\_OUT.

\subsubsection{Netfilter hooks within DECnet}

There are seven decnet hooks.  The first five hooks (NF\_DN\_PRE\_ROUTING,
NF\_DN\_LOCAL\_IN, NF\_DN\_FORWARD, NF\_DN\_LOCAL\_OUT, NF\_DN\_POST\_ROUTING)
are prretty much the same as in IPv4.  The last two hooks (NF\_DN\_HELLO,
NF\_DN\_ROUTE) are used in conjunction with DECnet Hello and Routing packets.

\subsubsection{Netfilter hooks within ARP}

Recent kernels\footnote{IIRC, starting with 2.4.19-pre3} have added support for netfilter hooks within the ARP code.
There are two hooks: NF\_ARP\_IN and NF\_ARP\_OUT, for incoming and outgoing
ARP packets respectively.

\subsubsection{Netfilter hooks within IPX}

There have been experimental patches to add netfilter hooks to the IPX code,
but they never got integrated into the kernel source.

\subsection{Packet selection using IP Tables}

The IP tables core (ip\_tables.o) provides a generic layer for evaluation
of rulesets.  

An IP table consists out of an arbitrary number of {\it chains}, which in turn
consist out of a linear list of {\it rules}, which again consist out of any
number of {\it matches} and one {\it target}.

{\it Chains} can further be devided into two classes: Either {\it builtin
chains} or {\it user-defined chains}.  Builtin chains are always present, they
are created upon table registration.  They are also the entry points for table
iteration.  User defined chains are created at runtime upon user interaction.

{\it Matches} specify the matching criteria, there can be zero or more matches

{\it Targets} specify the action which is to be executed in case {\bf all}
matches match.  There can only be a single target per rule.

Matches and targets can either be {\it builtin} or {\it linux kernel modules}.

There are two special targets:
\begin{itemize}
\item
By using a chain name as target, it is possible to jump to the respective chain
in case the matches match.
\item
By using the RETURN target, it is possible to return to the previous (calling)
chain
\end{itemize}

The IP tables core handles the following functions
\begin{itemize}
\item
Registering and unregistering tables
\item
Registering and unregistering matches and targets (can be implemented as linux kernel modules)
\item
Kernel / userspace interface for manipulation of IP tables
\item
Traversal of IP tables
\end{itemize}

\subsubsection{Packet filtering unsing the ``filter'' table}

Traditional packet filtering (i.e. the successor to ipfwadm/ipchains) takes
place in the ``filter'' table.  Packet filtering works like a sieve: A packet
is (in the end) either dropped or accepted - but never modified.

The ``filter'' table is implemented in the {\it iptable\_filter.o} module
and contains three builtin chains:

\begin{itemize}
\item
{\bf INPUT} attaches to NF\_IP\_LOCAL\_IN
\item
{\bf FORWARD} attaches to NF\_IP\_FORWARD
\item
{\bf OUTPUT} attaches to NF\_IP\_LOCAL\_OUT
\end{itemize}

The placement of the chains / hooks is done in such way, that evey concievable
packet always traverses only one of the built-in chains.  Packets destined for
the local host traverse only INPUT, packets forwarded only FORWARD and
locally-originated packets only OUTPUT.

\subsubsection{Packet mangling using the ``mangle'' table} 

As stated above, operations which would modify a packet do not belong in the
``filter'' table.   The ``mangle'' table is available for all kinds of packet
manipulation - but not manipulation of addresses (which is NAT).

The mangle table attaches to all five netfilter hooks and provides the
respectiva builtin chains (PREROUTING, INPUT, FORWARD, OUTPUT, POSTROUTING)
\footnote{This has changed through recent 2.4.x kernel series, old kernels may
only support three (PREROUTING, POSTROUTING, OUTPUT) chains.}.

\subsection{Connection Tracking Subsystem}

Traditional packet filters can only match on matching criteria within the
currently processed packet, like source/destination IP address, port numbers,
TCP flags, etc.  As most applications have a notion of connections or at least
a request/response style protocol, there is a lot of information which can not
be derived from looking at a single packet.

Thus, modern (stateful) packet filters attempt to track connections (flows)
and their respective protocol states for all traffic through the packet
filter.

Connection tracking within linux is implemented as a netfilter module, called
ip\_conntrack.o.  

Before describing the connection tracking subsystem, we need to describe a couple of definitions and primitives used throughout the conntrack code.

A connection is represented within the conntrack subsystem using {\it struct
ip\_conntrack}, also called {\it connection tracking entry}.

Connection tracking is utilizing {\it conntrack tuples}, which are tuples
consisting out of (srcip, srcport, dstip, dstport, l4prot).  A connection is
uniquely identified by two tuples:  The tuple in the original direction
(IP\_CT\_DIR\_ORIGINAL) and the tuple for the reply direction
(IP\_CT\_DIR\_REPLY).

Connection tracking itself does not drop packets\footnote{well, in some rare
cases in combination with NAT it needs to drop. But don't tell anyone, this is
secret.} or impose any policy.  It just associates every packet with a
connection tracking entry, which in turn has a particular state.  All other
kernel code can use this state information\footnote{state information is
internally represented via the {\it struct sk\_buff.nfct} structure member of a
packet.}.

\subsubsection{Integration of conntrack with netfilter} 

If the ip\_conntrack.o module is registered with netfilter, it attaches to the
NF\_IP\_PRE\_ROUTING, NF\_IP\_POST\_ROUTING, NF\_IP\_LOCAL\_IN and
NF\_IP\_LOCAL\_OUT hooks.

Because forwarded packets are the most common case on firewalls, I will only
describe how connection tracking works for forwarded packets.  The two relevant
hooks for forwarded packets are NF\_IP\_PRE\_ROUTING and NF\_IP\_POST\_ROUTING.

Every time a packet arrives at the NF\_IP\_PRE\_ROUTING hook, connection
tracking creates a conntrack tuple from the packet.  It then compares this
tuple to the original and reply tuples of all already-seen connections
\footnote{Of course this is not implemented as a linear search over all existing connections.} to find out if this just-arrived packet belongs to any existing
connection.  If there is no match, a new conntrack table entry (struct
ip\_conntrack) is created.

Let's assume the case where we have already existing connections but are
starting from scratch.

The first packet comes in, we derive the tuple from the packet headers, look up
the conntrack hash table, don't find any matching entry.  As a result, we
create a new struct ip\_conntrack.  This struct ip\_conntrack is filled with
all necessarry data, like the original and reply tuple of the connection.
How do we know the reply tuple?  By inverting the source and destination
parts of the original tuple.\footnote{So why do we need two tuples, if they can
be derived from each other?  Wait until we discuss NAT.}
Please note that this new struct ip\_conntrack is {\bf not} yet placed
into the conntrack hash table.

The packet is now passed on to other callback functions which have registered
with a lower priority at NF\_IP\_PRE\_ROUTING.  It then continues traversal of
the network stack as usual, including all respective netfilter hooks.

If the packet survives (i.e. is not dropped by the routing code, network stack,
firewall ruleset, ...), it re-appears at NF\_IP\_POST\_ROUTING.  In this case,
we can now safely assume that this packet will be sent off on the outgoing
interface, and thus put the connection tracking entry which we created at
NF\_IP\_PRE\_ROUTING into the conntrack hash table.  This process is called
{\it confirming the conntrack}.

The connection tracking code itself is not monolithic, but consists out of a
couple of seperate modules\footnote{They don't actually have to be seperate
kernel modules; e.g. TCP, UDP and ICMP tracking modules are all part of
the linux kernel module ip\_conntrack.o}.  Besides the conntrack core, there
are two important kind of modules: Protocol helpers and application helpers.

Protocol helpers implement the layer-4-protocol specific parts.  They currently
exist for TCP, UDP and ICMP (an experimental helper for GRE exists).

\subsubsection{TCP connection tracking}

As TCP is a connection oriented protocol, it is not very difficult to imagine
how conntection tracking for this protocol could work.  There are well-defined
state transitions possible, and conntrack can decide which state transitions
are valid within the TCP specification.  In reality it's not all that easy,
since we cannot assume that all packets that pass the packet filter actually
arrive at the receiving end, ...

It is noteworthy that the standard connection tracking code does {\bf not}
do TCP sequence number and window tracking.  A well-maintained patch to add
this feature exists almost as long as connection tracking itself.  It will
be integrated with the 2.5.x kernel.  The problem with window tracking is
it's bad interaction with connection pickup.  The TCP conntrack code is able to
pick up already existing connections, e.g. in case your firewall was rebooted.
However, connection pickup is conflicting with TCP window tracking:  The TCP
window scaling option is only transferred at connection setup time, and we
don't know about it in case of pickup...

\subsubsection{ICMP tracking}

ICMP is not really a connection oriented protocol.  So how is it possible to
do connection tracking for ICMP?

The ICMP protocol can be split in two groups of messages

\begin{itemize}
\item
ICMP error messages, which sort-of belong to a different connection
ICMP error messages are associated {\it RELATED} to a different connection.
(ICMP\_DEST\_UNREACH, ICMP\_SOURCE\_QUENCH, ICMP\_TIME\_EXCEEDED,
ICMP\_PARAMETERPROB, ICMP\_REDIRECT). 
\item
ICMP queries, which have a request->reply character.  So what the conntrack
code does, is let the request have a state of {\it NEW}, and the reply 
{\it ESTABLISHED}.  The reply closes the connection immediately.
(ICMP\_ECHO, ICMP\_TIMESTAMP, ICMP\_INFO\_REQUEST, ICMP\_ADDRESS)
\end{itemize}

\subsubsection{UDP connection tracking}

UDP is designed as a connectionless datagram protocol. But most common
protocols using UDP as layer 4 protocol have bi-directional UDP communication.
Imagine a DNS query, where the client sends an UDP frame to port 53 of the
nameserver, and the nameserver sends back a DNS reply packet from it's  UDP
port 53 to the client.

Netfilter trats this as a connection. The first packet (the DNS request) is
assigned a state of {\it NEW}, because the packet is expected to create a new
'connection'. The dns servers' reply packet is marked as {\it ESTABLISHED}.

\subsubsection{conntrack application helpers}

More complex application protocols involving multiple connections need special
support by a so-called ``conntrack application helper module''.  Modules in
the stock kernel come for FTP and IRC(DCC).  Netfilter CVS currently contains
patches for PPTP, H.323, Eggdrop botnet, tftp ald talk.  We're still lacking
a lot of protocols (e.g. SIP, SMB/CIFS) - but they are unlikely to appear
until somebody really needs them and either develops them on his own or
funds development.

\subsubsection{Integration of connection tracking with iptables}

As stated earlier, conntrack doesn't impose any policy on packets.  It just
determines the relation of a packet to already existing connections.  To base
packet filtering decision on this sate information, the iptables {\it state}
match can be used.  Every packet is within one of the following categories:

\begin{itemize}
\item
{\bf NEW}: packet would create a new connection, if it survives
\item
{\bf ESTABLISHED}: packet is part of an already established connection 
(either direction)
\item
{\bf RELATED}: packet is in some way related to an already established connection, e.g. ICMP errors or FTP data sessions
\item
{\bf INVALID}: conntrack is unable to derive conntrack information from this packet.  Please note that all multicast or broadcast packets fall in this category.
\end{itemize}

\subsection{NAT Subsystem}

The NAT (Network Address Translation)  subsystem is probably the worst
documented subsystem within the whole framework.  This has two reasons:  NAT is
nasty and complicated.  The Linux 2.4.x NAT implementation is easy to use, so
nobody needs to know the nasty details.

Nonetheless, as I was traditionally concentrating most on the conntrack and NAT
systems, I will give a short overview.

NAT uses almost all of the previously described subsystems:
\begin{itemize}
\item
IP tables to specify which packets to NAT in which particular way. NAT
registers a ``nat'' table with PREROUTING, POSTROUTING and OUTPUT chains. 
\item
Connection tracking to associate NAT state with the connection.
\item
Netfilter to do the actuall packet manipulation transparent to the rest of the
kernel.  NAT registers with NF\_IP\_PRE\_ROUTING, NF\_IP\_POST\_ROUTING,
NF\_IP\_LOCAL\_IN and NF\_IP\_LOCAL\_OUT.
\end{itemize}

The NAT implementation supports all kinds of different nat; Source NAT,
Destination NAT, NAT to address/port ranges, 1:1 NAT, ...

This fundamental design principle is still frequently misunderstood:\\
The information about which NAT mappings apply to a certain connection
is only gathered once - with the first packet of every connection.

So let's start to look at the life of a poor to-be-nat'ed packet.
For ease of understanding, I have chosen to describe the most frequently
used NAT scenario:  Source NAT of a forwarded packet.  Let's assume the
packet has an original source address of 1.1.1.1, an original destination
address of 2.2.2.2, and is going to be SNAT'ed to 9.9.9.9.  Let's further
ignore the fact that there are port numbers.

Once upon a time, our poor packet arrives at NF\_IP\_PRE\_ROUTING, where
conntrack has registered with highest priority.  This means that a conntrack
entry with the following two tuples is created:
\begin{verbatim}
IP_CT_DIR_ORIGINAL: 1.1.1.1 -> 2.2.2.2
IP_CT_DIR_REPLY: 2.2.2.2 -> 1.1.1.1
\end{verbatim}
After conntrack, the packet traverses the PREROUTING chain of the ``nat''
IP table.  Since only destination NAT happens at PREROUTING, no action
occurs.  After it's lengthy way through the rest of the network stack,
the packet arrives at the NF\_IP\_POST\_ROUTING hook, where it traverses
the POSTROUTING chain of the ``nat'' table.  Here it hits a SNAT rule,
causing the following actions:
\begin{itemize}
\item
Fill in a {\it struct ip\_nat\_manip}, indicating the new source address
and the type of NAT (source NAT at POSTROUTING).  This struct is part of the
conntrack entry.
\item
Automatically derive the inverse NAT transormation for the reply packets:
Destination NAT at PREROUTING.  Fill in another {\it struct ip\_nat\_manip}.
\item
Alter the REPLY tuple of the conntrack entry to
\begin{verbatim}
IP_CT_DIR_REPLY: 2.2.2.2 -> 9.9.9.9
\end{verbatim}
\item
Apply the SNAT transformation to the packet
\end{itemize}

Every other packt within this connection, independent of its direction,
will only execute the last step.  Since all NAT information is connected
with the conntrack entry, there is no need to do anything but to apply
the same transormations to all packets witin the same connection.

\subsection{IPv6 Firewalling with ip6tables}

Yes, Linux 2.4.x comes with a usable, though incomplete system to secure
your IPv6 network.

The parts ported to IPv6 are
\begin{itemize}
\item
IP tables (called IP6 tables)
\item
The ``filter'' table
\item
The ``mangle'' table
\item
The userspace library (libip6tc)
\item
The command line tool (ip6tables)
\end{itemize}

Due to the lack of conntrack and NAT\footnote{for god's sake we don't have NAT
with IPv6}, only traditional, stateless packet filtering is possible.  Apart
from the obvious matches/targets, ip6tables can match on
\begin{itemize}
\item
{\it EUI64 checker}; verifies if the MAC address of the sender is the same as in the EUI64 64 least significant bits of the source IPv6 address
\item
{\it frag6 match}, matches on IPv6 fragmentation header
\item
{\it route6 match}, matches on IPv6 routing header
\item
{\it ahesp6 match}, matches on SPIDs within AH or ESP over IPv6 packets
\end{itemize}

However, the ip6tables code doesn't seem to be used very widely (yet?). 
So please expect some potential remaining issues, since it is not tested
as heavily as iptables.

\subsection{Recent Development}

Please refer to the spoken word at the presentation.  Development at the 
time this paper was written can be quite different from development at the
time the presentation is held.

\section{Thanks}

I'd like to thank
\begin{itemize}
\item
{\it Linus Torvalds} for starting this interesting UNIX-like kernel
\item
{\it Alan Cox, David Miller, Alexey Kuznetesov, Andi Kleen} for building 
(one of?) the world's best TCP/IP stacks.
\item
{\it Paul ``Rusty'' Russell} for starting the netfilter/iptables project
\item
{\it The Netfilter Core Team} for continuing the netfilter/iptables effort
\item
{\it Astaro AG} for partially funding my current netfilter/iptables work
\item
{\it Conectiva Inc.} for partially funding parts of my past netfilter/iptables
work and for inviting me to live in Brazil
\item
{\it samba.org and Kommunikationsnetz Franken e.V.} for hosting the netfilter
homepage, CVS, mailing lists, ...
\end{itemize}

\end{document}