summaryrefslogtreecommitdiff
path: root/2004/netfilter-failover-lk2004/netfilter-failover-lk2004.tex
blob: d327bac51a1c611e27f9ee1aa6f5156d3ef270d1 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
\documentclass[twocolumn,12pt]{article}

\usepackage{alltt}

\usepackage[T1]{fontenc}
\usepackage[latin1]{inputenc}
\usepackage{isolatin1}
\usepackage{latexsym}
\usepackage{textcomp}
\usepackage{times}
\usepackage{url}
\usepackage[T1,obeyspaces]{zrl}

% "verbatim" with line breaks, obeying spaces
\providecommand\code{\begingroup \xrlstyle{tt}\Xrl}
% as above, but okay to break lines at spaces
\providecommand\brcode{\begingroup \zrlstyle{tt}\Zrl}

% Same as the pair above, but 'l' for long == small type
\providecommand\lcode{\begingroup \small\xrlstyle{tt}\Xrl}
\providecommand\lbrcode{\begingroup \small\zrlstyle{tt}\Zrl}

% For identifiers - "verbatim" with line breaks at punctuation
\providecommand\ident{\begingroup \urlstyle{tt}\Url}
\providecommand\lident{\begingroup \small\urlstyle{tt}\Url}




\begin{document}

% Required: do not print the date.
\date{}

\title{\texttt{ct\_sync}: state replication of \texttt{ip\_conntrack}\\
% {\normalsize Subtitle goes here}
}

\author{
Harald Welte \\
{\em netfilter core team / Astaro AG / hmw-consulting.de}\\
{\tt\normalsize laforge@gnumonks.org}\\
% \and
% Second Author\\
% {\em Second Institution}\\
% {\tt\normalsize another@address.for.email.com}\\
} % end author section

\maketitle

% Required: do not use page numbers on title page.
\thispagestyle{empty}

\section*{Abstract}

With traditional, stateless firewalling (such as ipfwadm, ipchains)
there is no need for special HA support in the firewalling
subsystem. As long as all packet filtering rules and routing table
entries are configured in exactly the same way, one can use any
available tool for IP-Address takeover to accomplish the goal of
failing over from one node to the other. 

With Linux 2.4/2.6 netfilter/iptables, the Linux firewalling code
moves beyond traditional packet filtering. Netfilter provides a
modular connection tracking susbsystem which can be employed for
stateful firewalling. The connection tracking subsystem gathers
information about the state of all current network flows
(connections). Packet filtering decisions and NAT information is
associated with this state information. 

In a high availability scenario, this connection tracking state needs
to be replicated from the currently active firewall node to all
standby slave firewall nodes. Only when all connection tracking state
is replicated, the slave node will have all necessary state
information at the time a failover event occurs. 

Due to funding by Astaro AG, the netfilter/iptables project now offers
a \ident{ct_sync} kernel module for replicating connection tracking state
accross multiple nodes. The presentation will cover the architectural
design and implementation of the connection tracking failover sytem.  


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%           BODY OF PAPER GOES HERE                      %%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Failover of stateless firewalls}

There are no special precautions when installing a highly available
stateless packet filter.  Since there is no state kept, all information
needed for filtering is the ruleset and the individual, separate packets.

Building a set of highly available stateless packet filters can thus be
achieved by using any traditional means of IP-address takeover, such 
as Heartbeat or VRRPd.

The only remaining issue is to make sure the firewalling ruleset is
exactly the same on both machines.  This should be ensured by the firewall
administrator every time he updates the ruleset and can be optionally managed
by some scripts utilizing scp or rsync.

If this is not applicable, because a very dynamic ruleset is employed, one can
build a very easy solution using iptables-supplied tools iptables-save and
iptables-restore.  The output of iptables-save can be piped over ssh to
iptables-restore on a different host.

Limitations
\begin{itemize}
\item
no state tracking
\item
not possible in combination with iptables stateful NAT
\item
no counter consistency of per-rule packet/byte counters
\end{itemize}

\section{Failover of stateful firewalls}

Modern firewalls implement state tracking (a.k.a.\ connection tracking) in order
to keep some state about the currently active sessions.  The amount of
per-connection state kept at the firewall depends on the particular
configuration and networking protocols used.

As soon as \texttt{any} state is kept at the packet filter, this state
information needs to be replicated to the slave/backup nodes within the
failover setup.

Since Linux 2.4.x, all relevant state is kept within the \textit{connection
tracking subsystem}.  In order to understand how this state could possibly be
replicated, we need to understand the architecture of this conntrack subsystem.

\subsection{Architecture of the Linux Connection Tracking Subsystem}

Connection tracking within Linux is implemented as a netfilter module, called
\ident{ip_conntrack.o} (\ident{ip_conntrack.ko} in 2.6.x kernels).  

Before describing the connection tracking subsystem, we need to describe a
couple of definitions and primitives used throughout the conntrack code.

A connection is represented within the conntrack subsystem using 
\brcode{struct ip_conntrack}, also called \textit{connection tracking entry}.

Connection tracking is utilizing \textit{conntrack tuples}, which are tuples
consisting of 
\begin{itemize}
\item
	source IP address
\item
	source port (or icmp type/code, gre key, ...)
\item
	destination IP address
\item
	destination port
\item
	layer 4 protocol number
\end{itemize}
	
A connection is uniquely identified by two tuples:  The tuple in the original
direction (\lident{IP_CT_DIR_ORIGINAL}) and the tuple for the reply direction
(\lident{IP_CT_DIR_REPLY}).

Connection tracking itself does not drop packets\footnote{well, in some rare
cases in combination with NAT it needs to drop. But don't tell anyone, this is
secret.} or impose any policy.  It just associates every packet with a
connection tracking entry, which in turn has a particular state.  All other
kernel code can use this state information\footnote{State information is
referenced via the \brcode{struct sk_buff.nfct} structure member of a
packet.}.

\subsubsection{Integration of conntrack with netfilter} 

If the \ident{ip_conntrack.[k]o} module is registered with netfilter, it
attaches to the \lident{NF_IP_PRE_ROUTING}, \lident{NF_IP_POST_ROUTING}, \lident{NF_IP_LOCAL_IN},
and \lident{NF_IP_LOCAL_OUT} hooks.

Because forwarded packets are the most common case on firewalls, I will only
describe how connection tracking works for forwarded packets.  The two relevant
hooks for forwarded packets are \lident{NF_IP_PRE_ROUTING} and \lident{NF_IP_POST_ROUTING}.

Every time a packet arrives at the \lident{NF_IP_PRE_ROUTING} hook, connection
tracking creates a conntrack tuple from the packet.  It then compares this
tuple to the original and reply tuples of all already-seen 
connections
\footnote{Of course this is not implemented as a linear
search over all existing connections.} to find out if this
just-arrived packet belongs to any existing 
connection.  If there is no match, a new conntrack table entry 
(\brcode{struct ip_conntrack}) is created.

Let's assume the case where we have already existing connections but are
starting from scratch.

The first packet comes in, we derive the tuple from the packet headers, look up
the conntrack hash table, don't find any matching entry.  As a result, we
create a new \brcode{struct ip_conntrack}.  This \brcode{struct ip_conntrack} is filled with
all necessarry data, like the original and reply tuple of the connection.
How do we know the reply tuple?  By inverting the source and destination
parts of the original tuple.\footnote{So why do we need two tuples, if they can
be derived from each other?  Wait until we discuss NAT.}
Please note that this new \brcode{struct ip_conntrack} is \textbf{not} yet placed
into the conntrack hash table.

The packet is now passed on to other callback functions which have registered
with a lower priority at \lident{NF_IP_PRE_ROUTING}.  It then continues traversal of
the network stack as usual, including all respective netfilter hooks.

If the packet survives (i.e., is not dropped by the routing code, network stack,
firewall ruleset, \ldots), it re-appears at \lident{NF_IP_POST_ROUTING}.  In this case,
we can now safely assume that this packet will be sent off on the outgoing
interface, and thus put the connection tracking entry which we created at
\lident{NF_IP_PRE_ROUTING} into the conntrack hash table.  This process is called
\textit{confirming the conntrack}.

The connection tracking code itself is not monolithic, but consists of a
couple of separate modules\footnote{They don't actually have to be separate
kernel modules; e.g.\ TCP, UDP, and ICMP tracking modules are all part of
the linux kernel module \ident{ip_conntrack.o}.}.  Besides the conntrack core,
there are two important kind of modules: Protocol helpers and application
helpers.

Protocol helpers implement the layer-4-protocol specific parts.  They currently
exist for TCP, UDP, and ICMP (an experimental helper for GRE exists).

\subsubsection{TCP connection tracking}

As TCP is a connection oriented protocol, it is not very difficult to imagine
how conntection tracking for this protocol could work.  There are well-defined
state transitions possible, and conntrack can decide which state transitions
are valid within the TCP specification.  In reality it's not all that easy,
since we cannot assume that all packets that pass the packet filter actually
arrive at the receiving end\ldots

It is noteworthy that the standard connection tracking code does \textbf{not}
do TCP sequence number and window tracking.  A well-maintained patch to add
this feature has existed for almost as long as connection tracking itself.  It
will be integrated with the 2.5.x kernel.  The problem with window tracking is
its bad interaction with connection pickup.  The TCP conntrack code is able to
pick up already existing connections, e.g.\ in case your firewall was rebooted.
However, connection pickup is conflicting with TCP window tracking:  The TCP
window scaling option is only transferred at connection setup time, and we
don't know about it in case of pickup\ldots

\subsubsection{ICMP tracking}

ICMP is not really a connection oriented protocol.  So how is it possible to
do connection tracking for ICMP?

The ICMP protocol can be split in two groups of messages:

\begin{itemize}
\item
ICMP error messages, which sort-of belong to a different connection
ICMP error messages are associated \textit{RELATED} to a different connection.
(\lident{ICMP_DEST_UNREACH}, \lident{ICMP_SOURCE_QUENCH},
\lident{ICMP_TIME_EXCEEDED},
\lident{ICMP_PARAMETERPROB}, \lident{ICMP_REDIRECT}). 
\item
ICMP queries, which have a \ident{request-reply} character.  So what
the conntrack
code does, is let the request have a state of \textit{NEW}, and the reply 
\textit{ESTABLISHED}.  The reply closes the connection immediately.
(\lident{ICMP_ECHO}, \lident{ICMP_TIMESTAMP}, \lident{ICMP_INFO_REQUEST}, \lident{ICMP_ADDRESS})
\end{itemize}

\subsubsection{UDP connection tracking}

UDP is designed as a connectionless datagram protocol. But most common
protocols using UDP as layer 4 protocol have bi-directional UDP communication.
Imagine a DNS query, where the client sends an UDP frame to port 53 of the
nameserver, and the nameserver sends back a DNS reply packet from its  UDP
port 53 to the client.

Netfilter treats this as a connection. The first packet (the DNS request) is
assigned a state of \textit{NEW}, because the packet is expected to create a new
`connection.'  The DNS server's reply packet is marked as \textit{ESTABLISHED}.

\subsubsection{conntrack application helpers}

More complex application protocols involving multiple connections need special
support by a so-called ``conntrack application helper module.''  Modules in
the stock kernel come for FTP, IRC (DCC), TFTP and Amanda.  Netfilter CVS currently contains
%%% orig: ``tftp ald talk'' -- um, 'tftp and talk'? Yes, that's correct. It refers
%%% to the talk protocol.
patches for PPTP, H.323, Eggdrop botnet, mms, DirectX, RTSP and talk/ntalk.  We're still lacking
a lot of protocols (e.g.\ SIP, SMB/CIFS)---but they are unlikely to appear
until somebody really needs them and either develops them on his own or
funds development.

\subsubsection{Integration of connection tracking with iptables}

As stated earlier, conntrack doesn't impose any policy on packets.  It just
determines the relation of a packet to already existing connections.
To base
packet filtering decision on this state information, the iptables \textit{state}
match can be used.  Every packet is within one of the following categories:

\begin{itemize}
\item
\textbf{NEW}: packet would create a new connection, if it survives
\item
\textbf{ESTABLISHED}: packet is part of an already established connection 
(either direction)
\item
\textbf{RELATED}: packet is in some way related to an already established
connection, e.g.\ ICMP errors or FTP data sessions 
\item
\textbf{INVALID}: conntrack is unable to derive conntrack information
from this packet.  Please note that all multicast or broadcast packets
fall in this category. 
\end{itemize}


\subsection{Poor man's conntrack failover}

When thinking about failover of stateful firewalls, one usually thinks about
replication of state.  This presumes that the state is gathered at one
firewalling node (the currently active node), and replicated to several other
passive standby nodes.  There is, however, a very different approach to
replication:  concurrent state tracking on all firewalling nodes. 

While this scheme has not been implemented within \ident{ct_sync}, the author
still thinks it is worth an explanation in this paper.

The basic assumption of this approach is: In a setup where all firewalling
%%% deduct or deduce?  I'd guess the latter, but I don't know, so I'm
%%% leaving it...
nodes receive exactly the same traffic, all nodes will deduct the same state
information.

The implementability of this approach is totally dependent on fulfillment of
this assumption.   

\begin{itemize}
\item
\textit{All packets need to be seen by all nodes}.  This is not always true, but
can be achieved by using shared media like traditional ethernet (no switches!!) 
and promiscuous mode on all ethernet interfaces.
\item
\textit{All nodes need to be able to process all packets}.  This cannot be
universally guaranteed.  Even if the hardware (CPU, RAM, Chipset, NICs) and
software (Linux kernel) are exactly the same, they might behave different,
especially under high load.  To avoid those effects, the hardware should be
able to deal with way more traffic than seen during operation.  Also, there
should be no userspace processes (like proxies, etc.) running on the firewalling
nodes at all.  WARNING: Nobody guarantees this behaviour.  However, the poor
man is usually not interested in scientific proof but in usability in his
particular practical setup.
\end{itemize}

However, even if those conditions are fulfilled, there are remaining issues:
\begin{itemize}
\item
\textit{No resynchronization after reboot}.  If a node is rebooted (because of
a hardware fault, software bug, software update, etc.) it will lose all state
information until the event of the reboot.  This means, the state information
of this node after reboot will not contain any old state, gathered before the
reboot.  The effects depend on the traffic.  Generally, it is only assured that
state information about all connections initiated after the reboot will be
present.  If there are short-lived connections (like http), the state
information on the just rebooted node will approximate the state information of
an older node.  Only after all sessions active at the time of reboot have
terminated, state information is guaranteed to be resynchronized.
\item
\textit{Only possible with shared medium}.  The practical implication is that no
switched ethernet (and thus no full duplex) can be used.
\end{itemize}

The major advantage of the poor man's approach is implementation simplicity.
No state transfer mechanism needs to be developed.  Only very little changes
to the existing conntrack code would be needed in order to be able to
do tracking based on packets received from promiscuous interfaces.  The active
node would have packet forwarding turned on, the passive nodes, off.

I'm not proposing this as a real solution to the failover problem.  It's
hackish, buggy, and likely to break very easily.  But considering it can be
implemented in very little programming time, it could be an option for very
small installations with low reliability criteria.

\subsection{Conntrack state replication}

The preferred solution to the failover problem is, without any doubt, 
replication of the connection tracking state.

The proposed conntrack state replication soltution consists of several
parts:
\begin{itemize}
\item
A connection tracking state replication protocol
\item
An event interface generating event messages as soon as state information
changes on the active node
\item
An interface for explicit generation of connection tracking table entries on
the standby slaves
\item
Some code (preferrably a kernel thread) running on the active node, receiving
state updates by the event interface and generating conntrack state replication
protocol messages
\item
Some code (preferrably a kernel thread) running on the slave node(s), receiving
conntrack state replication protocol messages and updating the local conntrack
table accordingly
\end{itemize}

Flow of events in chronological order:
\begin{itemize}
\item
\textit{on active node, inside the network RX softirq} 
\begin{itemize}
\item
	\ident{ip_conntrack} analyzes a forwarded packet
\item
	\ident{ip_conntrack} gathers some new state information
\item
	\ident{ip_conntrack} updates conntrack hash table
\item
	\ident{ip_conntrack} calls event API
\item
	function registered to event API builds and enqueues message to send ring 
\end{itemize}
\item
\textit{on active node, inside the conntrack-sync sender kernel thread}
	\begin{itemize}
	\item
	\ident{ct_sync_send} aggregates multiple messages into one packet
	\item
	\ident{ct_sync_send} dequeues packet from ring
	\item
	\ident{ct_sync_send} sends packet via in-kernel sockets API
	\end{itemize}
\item
\textit{on slave node(s), inside network RX softirq}
	\begin{itemize}
	\item
	\ident{ip_conntrack} ignores packets coming from the \ident{ct_sync} interface via NOTRACK mechanism
	\item
	UDP stack appends packet to socket receive queue of \ident{ct_sync_recv} kernel thread
	\end{itemize}
\item
\textit{on slave node(s), inside conntrack-sync receive kernel thread}
	\begin{itemize}
	\item
	\ident{ct_sync_recv} thread receives state replication packet
	\item
	\ident{ct_sync_recv} thread parses packet into individual messages
	\item
	\ident{ct_sync_recv} thread creates/updates local \ident{ip_conntrack} entry
	\end{itemize}
\end{itemize}


\subsubsection{Connection tracking state replication protocol}


  In order to be able to replicate the state between two or more firewalls, a
state replication protocol is needed.  This protocol is used over a private
network segment shared by all nodes for state replication.  It is designed to
work over IP unicast and IP multicast transport.  IP unicast will be used for
direct point-to-point communication between one active firewall and one
standby firewall.  IP multicast will be used when the state needs to be
replicated to more than one standby firewall.


  The principal design criteria of this protocol are:
\begin{itemize}
\item
	\textbf{reliable against data loss}, as the underlying UDP layer only
	provides checksumming against data corruption, but doesn't employ any
	means against data loss
\item
	\textbf{lightweight}, since generating the state update messages is
	already a very expensive process for the sender, eating additional CPU,
	memory, and IO bandwith.
\item
	\textbf{easy to parse}, to minimize overhead at the receiver(s)
\end{itemize}

The protocol does not employ any security mechanism like encryption,
authentication, or reliability against spoofing attacks.  It is
assumed that the private conntrack sync network is a secure communications
channel, not accessible to any malicious third party.

To achieve the reliability against data loss, an easy sequence numbering
scheme is used.  All protocol messages are prefixed by a sequence number,
determined by the sender.  If the slave detects packet loss by discontinuous
sequence numbers, it can request the retransmission of the missing packets
by stating the missing sequence number(s).  Since there is no acknowledgement
for sucessfully received packets, the sender has to keep a
reasonably-sized\footnote{\textit{reasonable size} must be large enough for the
round-trip time between master and slowest slave.} backlog of recently-sent
packets in order to be able to fulfill retransmission
requests.

The different state replication protocol packet types are:
\begin{itemize}
\item
\textbf{\ident{CT_SYNC_PKT_MASTER_ANNOUNCE}}: A new master announces itself.
Any still existing master will downgrade itself to slave upon
reception of this packet. 
\item
\textbf{\ident{CT_SYNC_PKT_SLAVE_INITSYNC}}: A slave requests initial
synchronization from the master (after reboot or loss of sync). 
\item
\textbf{\ident{CT_SYNC_PKT_SYNC}}: A packet containing synchronization data
from master to slaves 
\item
\textbf{\ident{CT_SYNC_PKT_NACK}}: A slave indicates packet loss of a
particular sequence number 
\end{itemize}

The messages within a \lident{CT_SYNC_PKT_SYNC} packet always refer to a particular
\textit{resource} (currently \lident{CT_SYNC_RES_CONNTRACK} and \lident{CT_SYNC_RES_EXPECT},
although support for the latter has not been fully implemented yet).  

For every resource, there are several message types.  So far, only
\lident{CT_SYNC_MSG_UPDATE} and \lident{CT_SYNC_MSG_DELETE} have been implemented.  This
means a new connection as well as state changes to an existing connection will
always be encapsulated in a \lident{CT_SYNC_MSG_UDPATE} message and therefore contain
the full conntrack entry.

To uniquely identify (and later reference) a conntrack entry, the only unique
criteria is used: \ident{ip_conntrack_tuple}.

\subsubsection{\texttt{ct\_sync} sender thread}

Maximum care needs to be taken for the implementation of the ctsyncd sender.

The normal workload of the active firewall node is likely to be already very
high, so generating and sending the conntrack state replication messages needs
to be highly efficient.

It was therefore decided to use a pre-allocated ringbuffer for outbound
\ident{ct_sync} packets.  New messages are appended to individual buffers in this
ring, and pointers into this ring are passed to the in-kernel sockets API to
ensure a minimum number of copies and memory allocations.

\subsubsection{\texttt{ct\_sync} initsync sender thread}

In order to facilitate ongoing state synchronization at the same time as
responding to initial sync requests of an individual slave, the sender has a
separate kernel thread for initial state synchronization (and \ident{ct_sync_initsync}).

At the moment it iterates over the state table and transmits packets with a
fixed rate of about 1000 packets per second, resulting in about 4000
connections per second, averaging to about 1.5 Mbps of bandwith consumed.

The speed of this initial sync should be configurable by the system
administrator, especially since there is no flow control mechanism, and the
slave node(s) will have to deal with the packets or otherwise lose sync again.

This is certainly an area of future improvement and development---but first we
want to see practical problems with this primitive scheme.

\subsubsection{\texttt{ct\_sync} receiver thread}

Implementation of the receiver is very straightforward.

For performance reasons, and to facilitate code-reuse, the receiver uses the
same pre-allocated ring buffer structure as the sender.  Incoming packets are
written into ring members and then successively parsed into their individual
messages.

Apart from dealing with lost packets, it just needs to call the
respective conntrack add/modify/delete functions.

\subsubsection{Necessary changes within netfilter conntrack core}

To be able to achieve the described conntrack state replication mechanism,
the following changes to the conntrack core were implemented:
\begin{itemize}
\item
	Ability to exclude certain packets from being tracked.  This was a
	long-wanted feature on the TODO list of the netfilter project and is
	implemented by having a ``raw'' table in combination with a
	``NOTRACK'' target.
\item
	Ability to register callback functions to be called every time a new
	conntrack entry is created or an existing entry modified.  This is
	part of the nfnetlink-ctnetlink patch, since the ctnetlink event
	interface also uses this API.
\item
	Export an API to externally add, modify, and remove conntrack entries.  
\end{itemize}

Since the number of changes is very low, their inclusion into the mainline
kernel is not a problem and can happen during the 2.6.x stable kernel series.


\subsubsection{Layer 2 dropping and \texttt{ct\_sync}}

In most cases, netfilter/iptables-based firewalls will not only function as
packet filter but also run local processes such as proxies, dns relays, smtp
relays, etc.

In order to minimize failover time, it is helpful if the full startup and
configuration of all network interfaces and all of those userspace processes
can happen at system bootup time rather then in the instance of a failover.

l2drop provides a convenient way for this goal:  It hooks into layer 2
netfilter hooks (immediately attached to \ident{netif_rx()} and 
\ident{dev_queue_xmit}) and blocks all incoming and outgoing network packets at this
very low layer.  Even kernel-generated messages such as ARP replies, IPv6
neighbour discovery, IGMP, \dots are blocked this way.

Of course there has to be an exemption for the state synchronization messages
themselves.  In order to still facilitate remote administration via SSH and
other communication between the cluster nodes, the whole network
interface used for synchronization is subject to this exemption from
l2drop. 

As soon as a node is propagated to master state, l2drop is disabled and the
system becomes visible to the network.


\subsubsection{Configuration}

All configuration happens via module parameters.

\begin{itemize}
\item
	\texttt{syncdev}: Name of the multicast-capable network device
	used for state synchronization among the nodes 
\item
	\texttt{state}: Initial state of the node (0=slave, 1=master)
\item
	\texttt{id}: Unique Node ID (0..255)
\item
	\texttt{l2drop}: Enable (1) or disable (0) the l2drop functionality
\end{itemize}
	
\subsubsection{Interfacing with the cluster manager}

As indicated in the beginning of this paper, \ident{ct_sync} itself does not provide
any mechanism to determine outage of the master node within a cluster.  This
job is left to a cluster manager software running in userspace.

Once an outage of the master is detected, the cluster manager needs to elect
one of the remaining (slave) nodes to become new master.  On this elected node,
the cluster manager will write the ascii character \texttt{1} into the 
\ident{/proc/net/ct_sync} file.  Reading from this file will return the current state
of the local node.

\section{Acknowledgements}

The author would like to thank his fellow netfilter developers for their
help.  Particularly important to \ident{ct_sync} is Krisztian KOVACS
\ident{<hidden@balabit.hu>}, who did a proof-of-concept implementation based on my
first paper on \ident{ct_sync} at OLS2002.

Without the financial support of Astaro AG, I would not have been able to spend any
time on \ident{ct_sync} at all.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\end{document}

personal git repositories of Harald Welte. Your mileage may vary