1 files changed, 426 insertions, 0 deletions
diff --git a/2005/flow-accounting-lt2005/ltpdk/paper/paper-11076.xml b/2005/flow-accounting-lt2005/ltpdk/paper/paper-11076.xml
new file mode 100644
index 0000000..a14546f
--- /dev/null
+++ b/2005/flow-accounting-lt2005/ltpdk/paper/paper-11076.xml
@@ -0,0 +1,426 @@
+<?xml version="1.0" encoding="ISO-8859-1"?>
+<article id="paper-11076">
+  <articleinfo>
+    <title>Flow based network accounting with Linux</title>
+    <author>
+      <firstname>Harald</firstname>
+      <surname>Welte</surname>
+    </author>
+    <copyright>
+      <year>2005</year>
+      <holder>Harald Welte</holder>
+    </copyright>
+  </articleinfo>
+
+<section>
+<title>Abstract</title>
+<para>
+Many networking scenarios require some form of network accounting that goes beyond some simple packet and byte counters as available from the 'ifconfig' output.
+</para>
+<para>
+Network accounting can generally be done in a number of different ways.  The
+traditional way is to capture all packets by some userspace program.  Capturing
+can be done via a number of mechanisms such as <emphasis>PF_PACKET</emphasis>
+sockets, mmap()ed <emphasis>PF_PACKET</emphasis>,
+<emphasis>ipt_ULOG</emphasis>, or <emphasis>ip_queue</emphasis>.   This
+userspace program then analyzes the packets and aggregates the result into
+per-flow data
+structures.
+</para>
+<para>
+Whatever mechanism used, this scheme has a fundamental performance limitation,
+since all packets need to be copied and analyzed by a userspace process.
+</para>
+<para>
+The author has implemented a different approach, by which the accounting
+information is stored in the in-kernel connection tracking table of the
+ip_conntrack stateful firewall state machine.  On all firewalls, that
+state table has to be kept anyways - the additional overhead introduced by
+accounting is minimal.
+</para>
+</section>
+
+<section>
+<title>Network accounting</title>
+<para>
+Network accounting generally describes the process of counting and potentially
+summarizing metadata of network traffic.  The kind of metadata is largely
+dependant on the particular application, but usually includes data such as numbers of packets, numbers of bytes, source and destination ip address.
+</para>
+<para>
+There are many reasons for doing accounting of networking traffic, among them
+</para>
+<itemizedlist>
+<listitem><para>transfer volume or bandwisth based billing</para></listitem>
+<listitem><para>monitoring of network utilization, bandwidth distribution and link usage</para></listitem>
+<listitem><para>research, such as distribution of traffic among protocols, average packet size, ...</para></listitem>
+</itemizedlist>
+</section>
+
+<section>
+<title>Existing accounting solutions for Linux</title>
+<para>
+There are a number of existing packages to do network accounting with Linux.
+The following subsections intend to give a short overview about the most
+commonly used ones.
+</para>
+
+<section>
+<title>nacctd</title>
+<para>
+<emphasis>nacctd</emphasis> also known as <emphasis>net-acct</emphasis> is probably
+the oldest known tool for network accounting under Linux (also works on other
+Unix-like operating systems).  The author of this paper has used
+<emphasis>nacctd</emphasis> as an accounting tool as early as 1995.  It was
+originally developed by Ulrich Callmeier, but apparently abandoned later on.
+The development seems to have continued in multiple branches, one of them being
+the <ulink url="http://netacct-mysql.gabrovo.com">netacct-mysql</ulink> branch,
+currently at version 0.79rc2.
+</para>
+<para>
+It's principle of operation is to use an <emphasis>AF_PACKET</emphasis> socket
+via <emphasis>libpcap</emphasis> in order to capture copies of all packets on
+configurable network interfaces.  It then does TCP/IP header parsing on each
+packet.  Summary information such as port numbers, ip addresses, number of
+bytes are then stored in an internal table for aggregation of successive
+packets of the same flow.  The table entries are evicted and stored in a
+human-readable ASCII file.  Patches exist for sending information directly into
+SQL databases, or saving data in machine-readable data format.
+</para>
+<para>
+As a pcap-based solution, it suffers from the performance penalty of copying
+every full packet to userspace.  As a packet-based solution, it suffers from
+the penalty of having to interpret every single packet.
+</para>
+</section>
+
+<section>
+<title>ipt_LOG based</title>
+<para>
+The Linux packet filtering subsystem iptables offers a way to log policy
+violations via the kernel message ring buffer. This mechanism is called
+<emphasis>ipt_LOG</emphasis> (or <emphasis>LOG target</emphasis>).  Such
+messages are then further processed by <emphasis>klogd</emphasis> and
+<emphasis>syslogd</emphasis>, which put them into one or multiple system log
+files.
+</para>
+<para>
+As <emphasis>ipt_LOG</emphasis> was designed for logging policy violations and
+not for accounting, it's overhead is significant.   Every packet needs to be
+interpreted in-kernel, then printed in ASCII format to the kernel message ring
+buffer, then copied from klogd to syslogd, and again copied into a text file.
+Even worse, most syslog installations are configured to write kernel log
+messages synchronously to disk, avoiding the usual write buffering of the block
+I/O layer and disk subsystem.
+</para>
+<para>
+To sum up and anlyze the data, often custom perl scripts are used.  Those perl
+scripts have to parse the LOG lines, build up a table of flows, add the packet
+size fields and finally export the data in the desired format.  Due to the inefficient storage format, performance is again wasted at analyzation time.
+</para>
+</section>
+
+<section>
+<title>ipt_ULOG based (ulogd, ulog-acctd)</title>
+<para>
+The iptables <emphasis>ULOG target</emphasis> is a more efficient version of
+the <emphasis>LOG target</emphasis> described above.  Instead of copying ascii
+messages via the kernel ring buffer, it can be configured to only copies the
+header of each packet, and send those copies in large batches.   A special
+userspace process, normally ulogd, receives those partial packet copies and
+does further interpretation.  
+</para>
+<para>
+<ulink url="http://gnumonks.org/projects/ulogd">ulogd</ulink> is intended for
+logging of security violations and thus resembles the functionality of LOG.  it
+creates one logfile entry per packet.  It supports logging in many formats,
+such as SQL databases or PCAP format.
+</para>
+<para>
+<ulink
+url="http://alioth.debian.org/projects/pkg-ulog-acctd/">ulog-acctd</ulink> is a
+hybrid between <emphasis>ulogd</emphasis> and <emphasis>nacctd</emphasis>.  It
+replaces the nacctd libpcap/PF_PACKET based capture with the more efficient
+ULOG mechanism.
+</para>
+<para>
+Compared to <emphasis>ipt_LOG</emphasis>, <emphasis>ipt_ULOG</emphasis> reduces
+the amount of copied data and required kernel/userspace context switches and
+thus improves performance.  However, the whole mechanism is still intended for
+logging of security violations.  Use for accounting is out of its design.
+</para>
+</section>
+
+<section>
+<title>iptables based (ipac-ng)</title>
+<para>
+Every packet filtering rule in the Linux packet filter
+(<emphasis>iptables</emphasis>, or even its predecessor
+<emphasis>ipchains</emphasis>) has two counters: number of packets and number
+of bytes matching this particular rule.
+</para>
+<para>
+By carefully placing rules with no target (fallthrough) rules in the
+packetfilter ruleset, one can implement an accounting setup, i.e. one rule per
+customer.
+</para>
+<para>
+A number of tools exist to parse the iptables command output and summarized the
+counters.  The most commonly used package is <ulink
+url="http://sourceforge.net/projects/ipac-ng/">ipac-ng</ulink>.  It supports
+advanced features such as storing accounting data in SQL databases.
+</para>
+<para>
+The approach works quite efficiently for small installations (i.e. small number
+of accounting rules).  Therefore, the accounting granularity can only be very
+low.  One counter for each single port number at any given ip address is certainly not applicable.
+</para>
+</section>
+
+<section>
+<title>ipt_ACCOUNT</title>
+<para>
+<ulink url="http://www.intra2net.com/opensource/ipt_account/">ipt_ACCOUNT</ulink>
+is a special-purpose iptables target available from the netfilter project
+patch-o-matic-ng repository.  It requires kernel patching and is not included
+in the mainline kernel.
+</para>
+<para>
+<emphasis>ipt_ACCOUNT</emphasis> keeps byte counters per IP address in a given
+subnet, up to a '/8' network.  Those counters can be read via a special
+"iptaccount" commandline tool.
+</para>
+<para>
+Being limited to local network segments up to '/8' size, and only having per-ip
+granularity are two limiteations that defeat <emphasis>ipt_ACCOUNT</emphasis>
+as a generich accounting mechainism.  It's highly-optimized, but also
+special-purpose.
+</para>
+</section>
+
+<section>
+<title>ntop (including PF_RING)</title>
+<para>
+<ulink url="http://www.ntop.org/ntop.html">ntop</ulink> is a network traffic
+probe to show network usage.  It uses <emphasis>libpcap</emphasis> to capture
+the packets, and then aggregates flows in userspace.  On a fundamental level it's therefore similar to what <emphasis>nacctd</emphasis> does.
+</para>
+<para>
+From the ntop project, there's also <emphasis>nProbe</emphasis>, a network
+traffic probe that exports flow based information in NETFLOW v5/v9 format.
+</para>
+<para>
+To increase performance of the probe, the author (Luca Deri) has implemented
+<ulink url="http://www.ntop.org/PF_RING.html">PF_RING</ulink>, a new zero-copy
+mmap()ed implementation for packet capture.  There is a libpcap compatibility layer on top, so any pcap-using application can benefit from PF_RING.
+</para>
+<para>
+PF_RING is a major performance improvement, please look at the documentation
+and the paper published by Luca Deri.
+</para>
+<para>
+However, ntop / nProbe / PF_RING are all packet-based accounting solutions.
+Every packet needs to be analyzed by some userspace process - even if there is
+no copying involved.  Due to PF_RING optimiziation, it is probably as efficient
+as this approach can get.
+</para>
+
+</section>
+
+</section> <!-- existing solutions -->
+
+<section>
+<title>New ip_conntrack based accounting</title>
+<para>
+The fundamental idea is to (ab)use the connection tracking subsystem of the
+Linux 2.4.x / 2.6.x kernel for accounting purposes.  There are several reasons
+why this is a good fit:
+</para>
+<itemizedlist>
+<listitem><para>It already keeps per-connection state information. Extending this information to contain a set of counters is easy.</para></listitem>
+<listitem><para>Lots of routers/firewalls are already running it, and therefore paying it's performance penalty for security reasons.  Bumping a couple of counters will introduce very little additional penalty.</para></listitem>
+<listitem><para>There was already an (out-of-tree) system to dump connection tracking information to userspace, called ctnetlink</para></listitem>
+</itemizedlist>
+<para>
+So given that a particular machine was already running ip_conntrack, adding
+flow based acconting to it comes almost for free.  I do not advocate the use of
+ip_conntrack merely for accounting, since that would be again a waste of
+performance.
+</para>
+
+<section>
+<title>ip_conntrack_acct</title>
+<para>
+<emphasis>ip_conntrack_acct</emphasis> is how the in-kernel
+<emphasis>ip_conntrack</emphasis> counters are called.  There is a set of four
+counters: numbers of packets and bytes for original and reply
+direction of a given connection.
+</para>
+<para>
+If you configure a recent (>= 2.6.9) kernel, it will prompt you for
+<emphasis>CONFIG_IP_NF_CT_ACCT</emphasis>.  By enabling this configuration
+option, the per-connection counters will be added, and the accounting code will
+be compiled in.
+</para>
+<para>
+However, there is still no efficient means of reading out those counters.  They
+can be accessed via "cat /proc/net/ip_conntrack", but that's not a real
+solution.  The kernel iterates over all connections and ASCII-formats the data.
+Also, it is a polling-based mechanism.   If the polling interval is too short,
+connections might get evicted from the state table before their final counters
+are being read.  If the interval is too small, performance will suffer.
+</para>
+<para>
+To counter this problem, a combination of conntrack notifiers and ctnetlink is being used.
+</para>
+</section>
+
+<section>
+<title>conntrack notifiers</title>
+<para>
+Conntrack notifiers use the core kernel notifier infrastructure
+(<emphasis>struct notifier_block</emphasis>) to notify other parts of the
+kernel about connection tracking events.  Such events include creation,
+deletion and modification of connection tracking entries.
+</para>
+<para>
+The conntrack notifiers can help us overcome the polling architecture.  If we'd only listen to "conntrack delete" events, we would always get the byte and packet counters at the end of a connection.
+</para>
+<para>
+However, the events are in-kernel events and therefore not directly suitable
+for an accounting application to be run in userspace.
+</para>
+</section>
+
+<section>
+<title>ctnetlink</title>
+<para>
+<emphasis>ctnetlink</emphasis> (short form for conntrack netlink) is a
+mechanism for passing connection tracking state information between kernel and
+userspace, originally developed by Jay Schulist and Harald Welte.   As the name
+implies, it uses Linux <emphasis>AF_NETLINK</emphasis> sockets as its
+underlying communication facility.
+</para>
+<para>
+The focus of <emphasis>ctnetlink</emphasis> is to selectively read or dump
+entries from the connection tracking table to userspace.  It also allows
+userspace processes to delete and create conntrack entries as well as
+"conntrack expectations".
+</para>
+<para>
+The initial nature of <emphasis>ctnetlink</emphasis> is therefore again
+polling-based.  An userspace process sends a request for certain information,
+the kernel responds with the requested information.  </para>
+<para>
+By combining <emphasis>conntrack notifiers</emphasis> with
+<emphasis>ctnetlink</emphasis>, it is possible to register a notifier handler
+that in turn sends <emphasis>ctnetlink</emphasis> event messages down the <emphasis>AF_NETLINK</emphasis> socket.
+</para>
+<para>
+A userspace process can now listen for such DELETE event messages at the
+socket, and put the counters into it's accounting storage.
+</para>
+<para>
+There are still some shortcomings inherent to that DELETE event scheme:  We
+only know the amount of traffic after the connection is over.  If a connection
+lasts for a long time (let's say days, weeks), then it is impossible to use
+this form of accounting for any kind of quota-based billing, where the user
+would be informed (or disconnected, traffic shaped, whatever) when he exceeds
+his quota.   Also, the conntrack entry does not contain information about when the connection started - only the timestamp of the end-of-connection is known.
+</para>
+<para>
+To overcome limitation number one, the accounting process can use a combined
+event and polling scheme.  The granularity of accounting can therefore be
+configured by the polling interval, and a compromise between performance and
+accuracy can be made.
+</para>
+<para>
+To overcome the second limitation, the accounting process can also listen for
+NEW event messages.  By correlating the NEW and DELETE messages of a
+connection, accounting datasets containign start and end of connection can be built.
+</para>
+</section>
+
+<section>
+<title>ulogd2</title>
+<para>
+As described earlier in this paper, <emphasis>ulogd</emphasis> is a userspace
+packet filter logging daemon that is already used for packet-based accounting,
+even if it isn't the best fit.
+</para>
+<para>
+<emphasis>ulogd2</emphasis>, also developed by the author of this paper, takes
+logging beyond per-packet based information, but also includes support for
+per-connection or per-flow based data.
+</para>
+<para>
+Instead of supporting only <emphasis>ipt_ULOG</emphasis> input, a number of
+interpreter and output plugins, <emphasis>ulogd2</emphasis> supports a concept
+called plugin stacks.  Multiple stacks can exist within one deamon.  Any such
+stack consists out of plugins.  A plugin can be a source, sink or filter.  
+</para>
+<para>
+Sources acquire per-packet or per-connection data from <emphasis>ipt_ULOG</emphasis> or <emphasis>ip_contnrack_acct</emphasis>.
+</para>
+<para>
+Filters allow the user to filter or aggregate information.   Filtering is
+requird, since there is no way to filter the ctnetlink event messages within
+the kernel.  Either the functionality is enabled or not.  Multiple connections
+can be aggregated to a larger, encompassing flow.  Packets could be aggregated
+to flows (like <emphasis>nacctd</emphasis>), and flows can be aggregated to
+even larger flows.
+</para>
+<para>
+Sink plugins store the resulting data to some form of non-volatile storage,
+such as SQL databases, binary or ascii files.   Another sink is a NETFLOW or
+IPFIX sink, exporting information in industy-standard format for flow based accounting.
+</para>
+</section>
+
+<section>
+<title>Status of implementation</title>
+<para>
+<emphasis>ip_conntrack_acct</emphasis> is already in the kernel since 2.6.9.  
+</para>
+<para>
+<emphasis>ctnetlink</emphasis> and the <emphasis>conntrack event
+notifiers</emphasis> are considered stable and will be submitted for mainline
+inclusion soon.  Both are available from the patch-o-matic-ng repository of the
+netfilter project.
+</para>
+<para>
+At the time of writing of this paper, <emphasis>ulogd2</emphasis> development
+was not yet finished.  However, the ctnetlink event messages can already be
+dumped by the use of the "conntrack" userspace program, available from the
+netfilter project.
+</para>
+<para>
+The "conntrack" prorgram can listen to the netlink event socket and dump the
+information in human-readable form (one ASCII line per ctnetlink message) to
+stdout.  Custom accounting solutions can read this information from stdin,
+parse and process it according to their needs.
+</para>
+</section>
+
+</section> <!-- new ip_conntrack based -->
+
+<section>
+<title>Summary</title>
+<para>
+Despite the large number of available accounting tools, the author is confident that inventing yet another one is worthwhile.
+</para>
+<para>
+Many existing implementations suffer from performance issues by design.  Most
+of them are very special-purpose.  nProbe/ntop together with PF_RING are
+probably the most universal and efficient solution for any accounting problem.
+</para>
+<para>
+Still, the new <emphasis>ip_conntrack_acct, ctnetlink</emphasis> based
+mechanism described in this paper has a clear performance advantage if you want
+to do acconting on your Linux-based stateful packetfilter - which is a common
+case.  The firewall is suposed to be at the edge of your network, exactly where
+you usually do accounting of ingress and/or egress traffic.
+</para>
+</section>
+
+</article>