2005/flow-accounting-ols2005/OLS2005/welte/welte.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408

% The file must begin with this \documentclass declaration. You can 
% give one of three different options which control how picky LaTeX 
% is when typesetting:
%
% galley - All ``this doesn't fit'' warnings are suppressed, and 
%          references are disabled (the key will be printed as a
%          reminder).  Use this mode while writing.
%
% proof -  All ``this doesn't fit'' warnings are active, as are
%          references.  Overfull hboxes make ugly black blobs in
%          the margin.  Use this mode to tidy up formatting after
%          you're done writing.  (Same as article's ``draft'' mode.)
%
% final -  As proof, but the ugly black blobs are turned off.  Use
%          this to render PDFs or PostScript to give to other people,
%          when you're completely done.  (As with article, this is the
%          default.)
%
% You can also use the leqno, fleqn, or openbib options to article.cls
% if you wish.  None of article's other options will work.

%%%
%%% PLEASE CHANGE 'galley' to 'final' BEFORE SUBMITTING.  THANKS!
%%% (to submit: "make clean" in the toplevel directory; tar and gzip *only* your directory;
%%% email the gzipped tarball to papers@linuxsymposium.org.)
%%%
\documentclass[final]{ols}

% These two packages allow easy handling of urls and identifiers per the example paper.
\usepackage{url}
\usepackage{zrl}

% The following package is not required, but is a handy way to put PDF and EPS graphics
% into your paper using the \includegraphics command.
\ifpdf
\usepackage[pdftex]{graphicx}
\else
\usepackage{graphicx}
\fi


% Here in the preamble, you may load additional packages, or
% define whatever macros you like, with the following exceptions:
%
% - Do not mess with the page layout, either by hand or with packages
%   (e.g., typearea, geometry).
% - Do not change the principal fonts, either by hand or with packages.
% - Do not use \pagestyle, or load any page header-related packages.
% - Do not redefine any commands having to do with article titles.
% - If you are using something that is not part of the standard
%   tetex-2 distribution, please make a note of whether it's on CTAN,
%   or include a copy with your submission.
%

\begin{document}

% Mandatory: article title specification.
% Do not put line breaks or other clever formatting in \title or
% \shortauthor; these are moving arguments.

\title{Flow-based network accounting with Linux}
\subtitle{ }  % Subtitle is optional.
\date{}             % You can put a fixed date in if you wish,
                    % allow LaTeX to use the date of typesetting,
                    % or use \date{} to have no date at all.
                    % Whatever you do, there will not be a date
                    % shown in the proceedings.

\shortauthor{Harald Welte}  % Just you and your coauthors' names.
% for example, \shortauthor{A.N.\ Author and A.\ Nother}
% or perchance \shortauthor{Smith, Jones, Black, White, Gray, \& Greene}

\author{%  Authors, affiliations, and email addresses go here, like this:
Harald Welte \\
{\itshape netfilter core team / hmw-consulting.de / Astaro AG} \\
{\ttfamily\normalsize laforge@netfilter.org}\\
% \and
% Bob \\
% {\itshape Bob's affiliation.}\\
% {\ttfamily\normalsize bob@example.com}\\
} % end author section

\maketitle

\begin{abstract}
% Article abstract goes here.
\input{welte-abstract.tex}
\end{abstract}

% Body of your article goes here.  You are mostly unrestricted in what
% LaTeX features you can use; however, the following will not work:
% \thispagestyle
% \marginpar
% table of contents
% list of figures / tables
% glossaries
% indices

\section{Network accounting}

Network accounting generally describes the process of counting and potentially
summarizing metadata of network traffic.  The kind of metadata is largely
dependant on the particular application, but usually includes data such as
numbers of packets, numbers of bytes, source and destination ip address.

There are many reasons for doing accounting of networking traffic, among them

\begin{itemize}
\item transfer volume or bandwisth based billing
\item monitoring of network utilization, bandwidth distribution and link usage
\item research, such as distribution of traffic among protocols, average packet size, ...
\end{itemize}

\section{Existing accounting solutions for Linux}

There are a number of existing packages to do network accounting with Linux.
The following subsections intend to give a short overview about the most
commonly used ones.


\subsection{nacctd}

\ident{nacctd} also known as \ident{net-acct} is probably the oldest known tool
for network accounting under Linux (also works on other Unix-like operating
systems).  The author of this paper has used
\ident{nacctd} as an accounting tool as early as 1995.  It was originally
developed by Ulrich Callmeier, but apparently abandoned later on.  The
development seems to have continued in multiple branches, one of them being
the netacct-mysql\footnote{http://netacct-mysql.gabrovo.com} branch,
currently at version 0.79rc2.

It's principle of operation is to use an \lident{AF_PACKET} socket
via \ident{libpcap} in order to capture copies of all packets on configurable
network interfaces.  It then does TCP/IP header parsing on each packet.
Summary information such as port numbers, IP addresses, number of bytes are
then stored in an internal table for aggregation of successive packets of the
same flow.  The table entries are evicted and stored in a human-readable ASCII
file.  Patches exist for sending information directly into SQL databases, or
saving data in machine-readable data format.

As a pcap-based solution, it suffers from the performance penalty of copying
every full packet to userspace.  As a packet-based solution, it suffers from
the penalty of having to interpret every single packet.

\subsection{ipt\_LOG based}

The Linux packet filtering subsystem iptables offers a way to log policy
violations via the kernel message ring buffer. This mechanism is called
\ident{ipt_LOG} (or \texttt{LOG target}).  Such messages are then further
processed by \ident{klogd} and \ident{syslogd}, which put them into one or
multiple system log files.

As \ident{ipt_LOG} was designed for logging policy violations and not for
accounting, it's overhead is significant.   Every packet needs to be
interpreted in-kernel, then printed in ASCII format to the kernel message ring
buffer, then copied from klogd to syslogd, and again copied into a text file.
Even worse, most syslog installations are configured to write kernel log
messages synchronously to disk, avoiding the usual write buffering of the block
I/O layer and disk subsystem.

To sum up and anlyze the data, often custom perl scripts are used.  Those perl
scripts have to parse the LOG lines, build up a table of flows, add the packet
size fields and finally export the data in the desired format.  Due to the inefficient storage format, performance is again wasted at analyzation time.

\subsection{ipt\_ULOG based (ulogd, ulog-acctd)}

The iptables \texttt{ULOG target} is a more efficient version of
the \texttt{LOG target} described above.  Instead of copying ascii messages via
the kernel ring buffer, it can be configured to only copies the header of each
packet, and send those copies in large batches.   A special userspace process,
normally ulogd, receives those partial packet copies and does further
interpretation.  

\ident{ulogd}\footnote{http://gnumonks.org/projects/ulogd} is intended for
logging of security violations and thus resembles the functionality of LOG.  it
creates one logfile entry per packet.  It supports logging in many formats,
such as SQL databases or PCAP format.

\ident{ulog-acctd}\footnote{http://alioth.debian.org/projects/pkg-ulog-acctd/}
is a hybrid between \ident{ulogd} and \ident{nacctd}.  It replaces the
\ident{nacctd} libpcap/PF\_PACKET based capture with the more efficient
ULOG mechanism.

Compared to \ident{ipt_LOG}, \ident{ipt_ULOG} reduces the amount of copied data
and required kernel/userspace context switches and thus improves performance.
However, the whole mechanism is still intended for logging of security
violations.  Use for accounting is out of its design.

\subsection{iptables based (ipac-ng)}

Every packet filtering rule in the Linux packet filter (\ident{iptables}, or
even its predecessor \ident{ipchains}) has two counters: number of packets and
number of bytes matching this particular rule.

By carefully placing rules with no target (so-called \textit{fallthrough})
rules in the packetfilter ruleset, one can implement an accounting setup, i.e.
one rule per customer.

A number of tools exist to parse the iptables command output and summarized the
counters.  The most commonly used package is
\ident{ipac-ng}\footnote{http://sourceforge.net/projects/ipac-ng/}.  It
supports advanced features such as storing accounting data in SQL databases.

The approach works quite efficiently for small installations (i.e. small number
of accounting rules).  Therefore, the accounting granularity can only be very
low.  One counter for each single port number at any given ip address is certainly not applicable.

\subsection{ipt\_ACCOUNT (iptaccount)}

\ident{ipt_ACCOUNT}\footnote{http://www.intra2net.com/opensource/ipt\_account/}
is a special-purpose iptables target developed by Intra2net AG and available
from the netfilter project patch-o-matic-ng repository.  It requires kernel
patching and is not included in the mainline kernel.

\ident{ipt_ACCOUNT} keeps byte counters per IP address in a given subnet, up to
a '/8' network.  Those counters can be read via a special \ident{iptaccount}
commandline tool.

Being limited to local network segments up to '/8' size, and only having per-ip
granularity are two limiteations that defeat \ident{ipt_ACCOUNT}
as a generich accounting mechainism.  It's highly-optimized, but also
special-purpose.

\subsection{ntop (including PF\_RING)}

\ident{ntop}\footnote{http://www.ntop.org/ntop.html} is a network traffic
probe to show network usage.  It uses \ident{libpcap} to capture
the packets, and then aggregates flows in userspace.  On a fundamental level
it's therefore similar to what \ident{nacctd} does.

From the ntop project, there's also \ident{nProbe}, a network traffic probe
that exports flow based information in Cisco NETFLOW v5/v9 format.  It also
contains support for the upcoming IETF IPFIX\footnote{IP Flow Information
Export http://www.ietf.org/html.charters/ipfix-charter.html} format.

To increase performance of the probe, the author (Luca Deri) has implemented
\lident{PF_RING}\footnote{http://www.ntop.org/PF\_RING.html}, a new
zero-copy mmap()ed implementation for packet capture.  There is a libpcap
compatibility layer on top, so any pcap-using application can benefit from
\lident{PF_RING}.

\lident{PF_RING} is a major performance improvement, please look at the
documentation and the paper published by Luca Deri.

However, \ident{ntop} / \ident{nProbe} / \lident{PF_RING} are all packet-based
accounting solutions.  Every packet needs to be analyzed by some userspace
process - even if there is no copying involved.  Due to \lident{PF_RING}
optimiziation, it is probably as efficient as this approach can get.

\section{New ip\_conntrack based accounting}

The fundamental idea is to (ab)use the connection tracking subsystem of the
Linux 2.4.x / 2.6.x kernel for accounting purposes.  There are several reasons
why this is a good fit:
\begin{itemize}
\item It already keeps per-connection state information. Extending this information to contain a set of counters is easy.
\item Lots of routers/firewalls are already running it, and therefore paying it's performance penalty for security reasons.  Bumping a couple of counters will introduce very little additional penalty.
\item There was already an (out-of-tree) system to dump connection tracking information to userspace, called ctnetlink
\end{itemize}

So given that a particular machine was already running \ident{ip_conntrack},
adding flow based acconting to it comes almost for free.  I do not advocate the
use of \ident{ip_conntrack} merely for accounting, since that would be again a
waste of performance.

\subsection{ip\_conntrack\_acct}

\ident{ip_conntrack_acct} is how the in-kernel
\ident{ip_conntrack} counters are called.  There is a set of four
counters: numbers of packets and bytes for original and reply
direction of a given connection.

If you configure a recent (>= 2.6.9) kernel, it will prompt you for
\lident{CONFIG_IP_NF_CT_ACCT}.  By enabling this configuration option, the
per-connection counters will be added, and the accounting code will
be compiled in.

However, there is still no efficient means of reading out those counters.  They
can be accessed via \textit{cat /proc/net/ip\_conntrack}, but that's not a real
solution.  The kernel iterates over all connections and ASCII-formats the data.
Also, it is a polling-based mechanism.   If the polling interval is too short,
connections might get evicted from the state table before their final counters
are being read.  If the interval is too small, performance will suffer.

To counter this problem, a combination of conntrack notifiers and ctnetlink is being used.

\subsection{conntrack notifiers}

Conntrack notifiers use the core kernel notifier infrastructure
(\texttt{struct notifier\_block}) to notify other parts of the
kernel about connection tracking events.  Such events include creation,
deletion and modification of connection tracking entries.

The \texttt{conntrack notifiers} can help us overcome the polling architecture.
If we'd only listen to \textit{conntrack delete} events, we would always get
the byte and packet counters at the end of a connection.

However, the events are in-kernel events and therefore not directly suitable
for an accounting application to be run in userspace.

\subsection{ctnetlink}

\ident{ctnetlink} (short form for conntrack netlink) is a
mechanism for passing connection tracking state information between kernel and
userspace, originally developed by Jay Schulist and Harald Welte.   As the name
implies, it uses Linux \lident{AF_NETLINK} sockets as its underlying
communication facility.

The focus of \ident{ctnetlink} is to selectively read or dump
entries from the connection tracking table to userspace.  It also allows
userspace processes to delete and create conntrack entries as well as
\textit{conntrack expectations}.

The initial nature of \ident{ctnetlink} is therefore again
polling-based.  An userspace process sends a request for certain information,
the kernel responds with the requested information.

By combining \texttt{conntrack notifiers} with \ident{ctnetlink}, it is possible
to register a notifier handler that in turn sends
\ident{ctnetlink} event messages down the \lident{AF_NETLINK} socket.

A userspace process can now listen for such \textit{DELETE} event messages at
the socket, and put the counters into it's accounting storage.

There are still some shortcomings inherent to that \textit{DELETE} event
scheme:  We only know the amount of traffic after the connection is over.  If a
connection lasts for a long time (let's say days, weeks), then it is impossible
to use this form of accounting for any kind of quota-based billing, where the
user would be informed (or disconnected, traffic shaped, whatever) when he
exceeds his quota.   Also, the conntrack entry does not contain information
about when the connection started - only the timestamp of the end-of-connection
is known.

To overcome limitation number one, the accounting process can use a combined
event and polling scheme.  The granularity of accounting can therefore be
configured by the polling interval, and a compromise between performance and
accuracy can be made.

To overcome the second limitation, the accounting process can also listen for
\textit{NEW} event messages.  By correlating the \textit{NEW} and
\textit{DELETE} messages of a connection, accounting datasets containign start
and end of connection can be built.

\subsection{ulogd2}

As described earlier in this paper, \ident{ulogd} is a userspace
packet filter logging daemon that is already used for packet-based accounting,
even if it isn't the best fit.

\ident{ulogd2}, also developed by the author of this paper, takes logging
beyond per-packet based information, but also includes support for
per-connection or per-flow based data.

Instead of supporting only \ident{ipt_ULOG} input, a number of
interpreter and output plugins, \ident{ulogd2} supports a concept
called \textit{plugin stacks}.  Multiple stacks can exist within one deamon.
Any such stack consists out of plugins.  A plugin can be a source, sink or
filter.

Sources acquire per-packet or per-connection data from
\ident{ipt_ULOG} or \ident{ip_contnrack_acct}.

Filters allow the user to filter or aggregate information.   Filtering is
requird, since there is no way to filter the ctnetlink event messages within
the kernel.  Either the functionality is enabled or not.  Multiple connections
can be aggregated to a larger, encompassing flow.  Packets could be aggregated
to flows (like \ident{nacctd}), and flows can be aggregated to
even larger flows.

Sink plugins store the resulting data to some form of non-volatile storage,
such as SQL databases, binary or ascii files.   Another sink is a NETFLOW or
IPFIX sink, exporting information in industy-standard format for flow based accounting.

\subsection{Status of implementation}

\ident{ip_conntrack_acct} is already in the kernel since 2.6.9.  

\ident{ctnetlink} and the \texttt{conntrack event notifiers} are considered
stable and will be submitted for mainline inclusion soon.  Both are available
from the patch-o-matic-ng repository of the netfilter project.

At the time of writing of this paper, \ident{ulogd2} development
was not yet finished.  However, the ctnetlink event messages can already be
dumped by the use of the "conntrack" userspace program, available from the
netfilter project.

The "conntrack" prorgram can listen to the netlink event socket and dump the
information in human-readable form (one ASCII line per ctnetlink message) to
stdout.  Custom accounting solutions can read this information from stdin,
parse and process it according to their needs.

\section{Summary}

Despite the large number of available accounting tools, the author is confident that inventing yet another one is worthwhile.

Many existing implementations suffer from performance issues by design.  Most
of them are very special-purpose.  nProbe/ntop together with \lident{PF_RING}
are probably the most universal and efficient solution for any accounting
problem.

Still, the new \ident{ip_conntrack_acct}, \ident{ctnetlink} based mechanism
described in this paper has a clear performance advantage if you want to do
acconting on your Linux-based stateful packetfilter - which is a common
case.  The firewall is suposed to be at the edge of your network, exactly where
you usually do accounting of ingress and/or egress traffic.

\end{document}