Advanced Linux Networking with iproute2 and tc Harald Welte
laforge@gnumonks.org
2000 Harald Welte INSERT GNU FDL HERE
Introduction As the Linux kernel is developed further and further, the network stack is one of the areas with the biggest changes and improvements at all time. Starting with Kernel 2.2, Alexey Kuznetsov introduced a whole new IPv4 routing subsystem (iproute2) as well as a traffic shaping subsystem (tc). Starting with Kernel 2.4.x, we now also have a real multithreading network stack, and of course the more-than-flexible netfilter and iptables subsystems. While most people know about the presence of these subsystems, the knowledge about their usage and the vast amount of possible applications is very little. One major problem is, that almost nobody who didn't read the source code or spent weeks and month playing around with those features is able to understand it. Mostly the lack of documentation is to blame for this situation. This documents intention is mainly to accompany my talk/presentation on CCC Congress 2000, but I think it still is worth reading independently. Overview What can I do using all this stuff? First I'll give a short overview about the possible applications of iproute2 and tc. Have routing decisions based on other things than destination address Traditional IP routing base the routing decision only on the destination IP address. While this is sufficient for most cases, modern networking scenarios may call for more sophisticated routing. Using iproute2, you may base the routing decision for each packet seperately, depending on various properties like owner of the sending socket, port numbers, type of service, ... Help you sharing bandwidth according to your needs In real-world scenarios you always have a limited bandwidth. As soon as this bandwidth is used by more and more users and/or services, you might want to control how much of your uplink's bandwidth is availabe for which service. Prevent certain DoS attacks There are certain kinds of DoS attacks which can be prevented through clever iproute2/tc usage. I'm especially referring to various flooding attacks. Advanced Routing with iproute2 Traditional IP Routing Before we'll dive into the iproute2 specific stuff, I'll give a short overview about how traditional IP routing works. Every host inside the IP network which is connected to more than one physical network segment is called a routergatewayrouter or gateway. Each of it's interfaces has a particular ip address and netmask configured. Now the router knows about which hosts to reach in which physical segment. To keep track about this information, it has a ## routing table. In addition to the information about which networks / hosts can be reached directly, it is possible to manually insert additional entries into this routing table. In most cases we have at least one default route entry, which specifies where to send all packets, which have a destination outside of the locally attached network segments. More advanced routers are using dynamic routing protocols like RIP, OSPF, ... to automatically adopt the routing table entries to network failures. Independent from how entries get into this ## routing table, sometimes also referred as Routing Information BaseRIBRIB (routing information base) - the decision about where to send the packet on pyhsical layer is always based on the destination IP address. At the first glance this seems quite obvious and correct - you want to get your packet to the destination, so why care about where the packet came from, or any other information. But it isn't that easy anymore. Nowadays people want to have stuff like pre-allocated or guaranteed bandwidth, or want to rout packets depending on which service they belong to (i.e. route web traffic over a different line than mail traffic). This is where iproute2 comes in: It is Linux's answer to this demand. iproute2 overview iproute2 is the 'new' IP network stack, as introduced in Linux 2.2.x by our Linux networking god ## Alexey Kuznetsov. Apart from a lot of other architectural changes, which mostly aim at increased performance, it also faciliates a routing engine capable of building routing decisions on almost anything you want (of course including the default case: Routing decision based on destination IP address). To make things more complicated, iproute2 has two meanings: The IP network stack The command to configure it Policy Routing Sow what architecture did Alexey and the other guys invent to provide the advanced routing features while keeping a backwards-compatible default behaviour? Instead of having one routing table for all packets, iproute2 enables us of having multiple routing tables. So how do we decide which routing table to use for a particular packet? We decide by information present in the ## routing policy database If we want to decide upon a packet's new destination (in other words: make a routing decision for this packet), we first look into the ## routing policy database, which tells us which routing table to use. The routing policy database consists out of a list of rules. Each rule consists out of three parts: priority A priority, which tells us about in which order we should traverse the ## routing policy database. match A match, telling us which packets match this rule. We have the following matches available: packet source address packet destination address TOS value Incoming interface fwmark (firewallmark, set by ipchains / iptables) The most flexible (and therefore most commonly used) match is the fwmarkfwmark match. Firewalling (to be more precise: Packet filtering based on ipchains or iptables) already has very sophisticated means for matching packets. You can easily select packets based on their TCP flags, TCP/UDP port numbers, and even on the state of the connection they belong to. Interactin between firewalling rules and policy routing works like this: iptables/ipchains rules assign the packet a fwmark according to the packet filtering rules (you can specify arbitrary 32bit numbers as fwmark for each rule). When the packet is to be routed and policy routing has to make a decision, it looks for a policy routing rule with the same fwmark the packet has, and performs the apropriate action connected with this rule (usually look up a specific routing table). action which action to perform, if a packet is matching this rule. Usually the action would point us to one of the routing tables, but we can also decide to drop the pacet or to return an ICMP error message to the sender. In order to use this routing policy database, you have to enable the compile-time kernel option "IP: policy routing" (CONFIG_IP_MULTIPLE_TABLES). The iproute2 command To configure the new linux IP stack, we use the iproute2 command. We can configure things like interface addresses, neigbour/arp tables, policy routing, routing table entries, tunnels, multicast routing, and a lot of other network-related stuff using this tool. iproute2 communicates over a sophisticated kernel-userspace interface, called ## netlink sockets, which are quite commonly used in other recent network-related stuff like netfilters userspace queueing and packet logging framework. iprote2 rule The iproute2 rule management (like most other iproute2-managable information) allows three basic operations: show Surprisingly, this command shows us the current policy routing rules. It doesn't take any additional arguments. add We can add a new entry to list of policy routing rules. Valid parameters are: type type of this rule from source address and mask to destination address and mask iif incoming interface name tos TOS value fwmark firewall mark field, set by ipchains/iptables delete delete Bandwidth Management Apart from having more flexible routing decisions, there are other demands for modern routers. Imagine an ISP which wants to pre-allocate specific bandwidthts of its uplink to a particular customer. Or even if you don't want to have hard bandwidth limits, you may want to give specific traffic a higher priority than other traffic. The major Buzzwords are QoS, ## packet scheduling, DiffServ. How to do bandwidth management The best way to influence which kind of packets get which part of the total available bandwidth is to influence how packets are enqueued at a intermediate router between a high-bandwidth and a low-bandwidth interface. More packets arrive on the high-bandwidth link than we can send out on the other side, the low-bandwidth link. The router has to enqueue the packets which are to be sent on the low-bandwidth interface. Once the queue is full, the router has to drop packets. Although there are several ways to influence this queue, in the end it's nothing more than deciding which packets are enqueued at which position inside the queue. Please note, that you can always only influence the sending path. TC: Linux Traffic Control The traffic control code in the Linux kernel consists of the following major conceptual components: queuing disciplines classes (within a queuing discipline) filters policy After the network stack inside the Linux kernel has made its routing decision, it knows on which network device the packet has to be sent out. Each network device has some information about how to enqueue the packets for this particular interface attached to its device structure. This queuing information is what the Linux developers called queuing disciplinequeuing discipline. A very simple queuing discipline ma just consist of a single queue, where all packets are stored in the order in which they have been enqueued, and which is emptied as fast as the respecitve network device can send. More elaborate queuing disciplines ma use ##filters to disinguish among different ##classes of packes an process each class in a specific way, e.g. by giving one class priority over other classes. Queuing disciplines and classes are itimatel tied together: the presence of classes and their semantics are fundamental properties of the queuing discipline. In contrast to that, filters can be combined arbitrarily with queuing disciplines and classes as long as the queuing discipline does provide classes at all. To further increase flexibility, each class can use another queuing discipline for enqueuing the packets. This queuing discipline can, in turn, again have multiple classes which each have their own queuing discipline attached, etc. All items inside TC are identified by a Handle. A handle consists out of a major and a minor number, seperated by a colon (example 10:0). Available queuing disciplines This chapter lists the currently available queuing disciplines an gives a short description of their functionality. Class Based Queue (CBQ) Tocken Bucket Filter (TBF) The Token Bucket Filter (TBF) is a simple queue, that only passes packets arriving at rate in bounds of some administratively set rates, with possibility to buffer short bursts. The TBF implementation consists of a buffer (bucket), constantly filled by some virtual pieces of information (called tokens) at a specific rate (called token rate). The most important parameter of the bucket is its size, that is the number of tokens it can store. Each arriving token lets one data packet out of the queue and is then delete from the bucket. Associating this algorithm with the two floews - token and data, gives us three possible scenarios: Data arrives into TBF at a rate equal to the rate of incoming tokens. In this case each packet has its matchin token and passes the queue without futher delay. Data arrives into TBF at a rate smaller than the token rate. Only some tokens are deleted from the bucket - one as each packet leaves - so tokens accumulate in the bucket, up to bucket size. The saved tokens can be used to send data in a higher rate than the token rate to compensate small bursts. Data arrives at a rate higer than the token rate. In this case a filter overrun occurs - incoming data can only be sent out without loss until all accumulated tokens are used. After that, overlimit packets are dropped. Class Based Queue (CBQ) This queue discipline classifies the waiting packets into a tree-like hierarchy of classes. The leaves of this tree are in turn scheduled by seperate queue disciplines. CBQ is a very commonly used scheduler. It is used as a basis for all the other queue disciplines. Stochastic Fairness Queuing (SFQ) SFQ is not quite deterministic, but works (on average). Its main benefits are that it requires little CPU and memory. SFQ consists of a dynamically allocated number of FIFO queues, one for each conversation. A conversation (or flow) is distinguished by its source/destination IP address and port numberso. The discipline runs in round-robin, sending one packet from each FIFO in one turn, and this is why it's called fair. The main advantage of SFQ is that it allows fair sharing the link between different applications. It prevents bandwidth-takeover by one client / one application. pfifo_fast The queue is, as the name says, first in, first out. That means that no packet receives any special treatment. At least, not quite. This qdisc has three so-called 'bands'. Within each band, FIFO rules apply. However, if there are packets waiting in band 0, band 1 won't be processed. Same goes for band 1 and band 2. Random Early Detect (RED) RED only works with TCP packets. It manipulates TCP's flow control (slow start). Once the link is filling up, it starts dropping packets. This indicates the TCP stack on the sending machine, that the link is congested, and the sender slows down. The clue is, that it simulates real congestion. Ingress policer The ingress policer implements a hard limit. You configure it to a specific rate, and all packets entering this queue exceeding the configured rate are dropped. Further Reading Acknowledgements Although I wrote this document, I wasn't involved in any of the iproute2 / tc development. I am still baffled by the abstract, flexible concept it provides. My thanks go out to the iproute2+tc developers, especially Alexey Kuznetsov (our Linux networking god) and Werner Almesberger. Thanks to Rusty Russel, who inspired me at OLS2000 and LBW2000 to get involved more deeply with netfilter. I want to thank Andi Kleen and Marc Boucer, for having some really nice discussions on our meetings in Munich. Not to forget Bert Hubert and his team for writing the Linux 2.4 Advanced Routing HOWTO. Additional special thanks to the people who invented DocBook. Glossary Differentiated Services DiffServ DiffServ is one of two actual QoS implementations (the other one is called Integrated Services) that is based on a value carried by packets in the DS field of the IP header. ipchains The packet filtering system in Linux 2.2 ipchains The packet filtering system in Linux 2.4, based on netfilter. netfilter Common term used for the Linux 2.4 firewalling subsystem. To be more precies, it is the infrastructure underlying packet filtering, NAT and packet mangling. Netlink Socket A special socket between kernel and userspace. Used by iproute2 to alter information in the routing tables, arp cache, policy routing database, ... Open Shortest Path First OSPF A dynamic routing protocol. Quality of Service QoS Guaranteeing a certain bandwidth for specific applications Routing Information Protocol RIP A dynamic routing protocol.