1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
|
<!doctype book PUBLIC "-//OASIS//DTD DocBook V3.1//EN"[]>
<book id="iproute2+tc-presentation">
<bookinfo>
<title>Advanced Linux Networking with iproute2 and tc</title>
<authorgroup>
<author>
<firstname>Harald</firstname>
<surname>Welte</surname>
<affiliation>
<address>
<email>laforge@gnumonks.org</email>
</address>
</affiliation>
</author>
</authorgroup>
<copyright>
<year>2000</year>
<holder>Harald Welte</holder>
</copyright>
<legalnotice>
<para>
INSERT GNU FDL HERE
</para>
</legalnotice>
</bookinfo>
<toc></toc>
<chapter id="intro">
<title>Introduction</title>
<para>
As the Linux kernel is developed further and further, the network stack is one of the areas with the biggest changes and improvements at all time. Starting with Kernel 2.2, Alexey Kuznetsov introduced a whole new IPv4 routing subsystem (iproute2) as well as a traffic shaping subsystem (tc). Starting with Kernel 2.4.x, we now also have a real multithreading network stack, and of course the more-than-flexible netfilter and iptables subsystems.
</para>
<para>
While most people know about the presence of these subsystems, the knowledge about their usage and the vast amount of possible applications is very little. One major problem is, that almost nobody who didn't read the source code or spent weeks and month playing around with those features is able to understand it. Mostly the lack of documentation is to blame for this situation.
</para>
<para>
This documents intention is mainly to accompany my talk/presentation on CCC Congress 2000, but I think it still is worth reading independently.
</para>
</chapter>
<chapter id="overview">
<title>Overview</title>
<sect1 id="over-what">
<title>What can I do using all this stuff?</title>
<para>
First I'll give a short overview about the possible applications of iproute2 and tc.
</para>
<variablelist>
<varlistentry>
<term>Have routing decisions based on other things than destination address</term>
<listitem>
<para>
Traditional IP routing base the routing decision only on the destination IP address. While this is sufficient for most cases, modern networking scenarios may call for more sophisticated routing. Using iproute2, you may base the routing decision for each packet seperately, depending on various properties like owner of the sending socket, port numbers, type of service, ...
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Help you sharing bandwidth according to your needs</term>
<listitem>
<para>
In real-world scenarios you always have a limited bandwidth. As soon as this bandwidth is used by more and more users and/or services, you might want to control how much of your uplink's bandwidth is availabe for which service.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Prevent certain DoS attacks</term>
<listitem>
<para>
There are certain kinds of DoS attacks which can be prevented through clever iproute2/tc usage. I'm especially referring to various flooding attacks.
</para>
</listitem>
</varlistentry>
</variablelist>
</sect1>
</chapter>
<chapter id="iproute2">
<title>Advanced Routing with iproute2</title>
<sect1 id="iproute2-traditional">
<title>Traditional IP Routing</title>
<para>
Before we'll dive into the iproute2 specific stuff, I'll give a short overview about how traditional IP routing works.
</para>
<para>
Every host inside the IP network which is connected to more than one physical network segment is called a <indexterm id="router"><primary>router</primary><seconrary>gateway</secondary></indexterm>router or gateway. Each of it's interfaces has a particular ip address and netmask configured. Now the router knows about which hosts to reach in which physical segment. To keep track about this information, it has a ## routing table. In addition to the information about which networks / hosts can be reached directly, it is possible to manually insert additional entries into this routing table. In most cases we have at least one default route entry, which specifies where to send all packets, which have a destination outside of the locally attached network segments. More advanced routers are using dynamic routing protocols like <glossterm linkend="gloss-rip">RIP</glossterm>, <glossterm linkend="gloss-ospf">OSPF</glossterm>, ... to automatically adopt the routing table entries to network failures.
</para>
<para>
Independent from how entries get into this ## routing table, sometimes also referred as <indexterm id="rib"><primary>Routing Information Base</primary><secondary>RIB</secondary></indexterm>RIB (routing information base) - the decision about where to send the packet on pyhsical layer is always based on the destination IP address.
</para>
<para>
At the first glance this seems quite obvious and correct - you want to get your packet to the destination, so why care about where the packet came from, or any other information. But it isn't that easy anymore. Nowadays people want to have stuff like pre-allocated or guaranteed bandwidth, or want to rout packets depending on which service they belong to (i.e. route web traffic over a different line than mail traffic).
</para>
<para>
This is where iproute2 comes in: It is Linux's answer to this demand.
</para>
</sect1>
<sect1 id="iproute2-overview">
<title>iproute2 overview
</title>
<para>
iproute2 is the 'new' IP network stack, as introduced in Linux 2.2.x by our Linux networking god ## Alexey Kuznetsov. Apart from a lot of other architectural changes, which mostly aim at increased performance, it also faciliates a routing engine capable of building routing decisions on almost anything you want (of course including the default case: Routing decision based on destination IP address).
</para>
<para>
To make things more complicated, iproute2 has two meanings:
<itemizedlist>
<listitem><para>The IP network stack</para></listitem>
<listitem><para>The command to configure it</para></listitem>
</itemizedlist>
</para>
</sect1>
<sect1 id="iproute2-rules">
<title>Policy Routing</title>
<para>
Sow what architecture did Alexey and the other guys invent to provide the advanced routing features while keeping a backwards-compatible default behaviour?
</para>
<para>
Instead of having one routing table for all packets, iproute2 enables us of having multiple routing tables. So how do we decide which routing table to use for a particular packet? We decide by information present in the ## routing policy database
</para>
<para>
If we want to decide upon a packet's new destination (in other words: make a routing decision for this packet), we first look into the ## routing policy database, which tells us which routing table to use.
</para>
<para>
The routing policy database consists out of a list of rules. Each rule consists out of three parts:
</para>
<variablelist>
<varlistentry>
<term>priority</term>
<listitem>
<para>
A priority, which tells us about in which order we should traverse the ## routing policy database.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>match</term>
<listitem>
<para>
A match, telling us which packets match this rule. We have the following matches available:
<itemizedlist>
<listitem><para>packet source address</para></listitem>
<listitem><para>packet destination address</para></listitem>
<listitem><para>TOS value</para></listitem>
<listitem><para>Incoming interface</para></listitem>
<listitem><para>fwmark (firewallmark, set by ipchains / iptables)</para></listitem>
</itemizedlist>
</para>
<para>
The most flexible (and therefore most commonly used) match is the <indexterm id="fwmark"><primary>fwmark</primary></indexterm>fwmark match. Firewalling (to be more precise: Packet filtering based on <glossterm linkend="gloss-ipchains">ipchains</ipchains> or <glossterm linkend="gloss-iptables">iptables</glossterm>) already has very sophisticated means for matching packets. You can easily select packets based on their TCP flags, TCP/UDP port numbers, and even on the state of the connection they belong to. Interactin between firewalling rules and policy routing works like this:
</para>
<para>
iptables/ipchains rules assign the packet a fwmark according to the packet filtering rules (you can specify arbitrary 32bit numbers as fwmark for each rule). When the packet is to be routed and policy routing has to make a decision, it looks for a policy routing rule with the same fwmark the packet has, and performs the apropriate action connected with this rule (usually look up a specific routing table).
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>action</term>
<listitem>
<para>
which action to perform, if a packet is matching this rule. Usually the action would point us to one of the routing tables, but we can also decide to drop the pacet or to return an ICMP error message to the sender.
</para>
</listitem>
</varlistentry>
</variablelist>
<para>
In order to use this routing policy database, you have to enable the compile-time kernel option "IP: policy routing" (CONFIG_IP_MULTIPLE_TABLES).
</para>
</sect1>
<sect1 id="iproute2-command">
<title>The iproute2 command</title>
<para>To configure the new linux IP stack, we use the iproute2 command. We can configure things like interface addresses, neigbour/arp tables, policy routing, routing table entries, tunnels, multicast routing, and a lot of other network-related stuff using this tool.
</para>
<para>
iproute2 communicates over a sophisticated kernel-userspace interface, called ## netlink sockets, which are quite commonly used in other recent network-related stuff like netfilters userspace queueing and packet logging framework.
</para>
<sect2 id="iproute2-command-rule">
<title>iprote2 rule</title>
<para>
The iproute2 rule management (like most other iproute2-managable information) allows three basic operations:
</para>
<variablelist>
<varlistentry>
<term>show</term>
<listitem>
<para>
Surprisingly, this command shows us the current policy routing rules. It doesn't take any additional arguments.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>add</term>
<listitem>
<para>
We can add a new entry to list of policy routing rules. Valid parameters are:
</para>
<itemizedlist>
<listitem>
<para>type</para>
<para>type of this rule</para>
</listitem>
<listitem>
<para>from</para>
<para> source address and mask </para>
</listitem>
<listitem>
<para>to</para>
<para> destination address and mask </para>
</listitem>
<listitem>
<para>iif</para>
<para>incoming interface name</para>
</listitem>
<listitem>
<para>tos</para>
<para>TOS value</para>
</listitem>
<listitem>
<para>fwmark</para>
<para>firewall mark field, set by ipchains/iptables</para>
</listitem>
</itemizedlist>
</listitem>
</varlistentry>
<varlistentry>
<term>delete</term>
<listitem>
<para>
delete
</para>
</listitem>
</varlistentry>
</variablelist>
<para>
</para>
</sect2>
</sect1>
</chapter>
<chapter id="tc">
<title>Bandwidth Management</title>
<para>
Apart from having more flexible routing decisions, there are other demands for modern routers. Imagine an ISP which wants to pre-allocate specific bandwidthts of its uplink to a particular customer. Or even if you don't want to have hard bandwidth limits, you may want to give specific traffic a higher priority than other traffic. The major Buzzwords are <glossterm linkend="gloss-qos">QoS</glossterm>, ## packet scheduling, <glossterm linkend="gloss-diffserv">DiffServ</glossterm>.
</para>
<sect1 id="tc-basics">
<title>How to do bandwidth management</title>
<para>
The best way to influence which kind of packets get which part of the total available bandwidth is to influence how packets are enqueued at a intermediate router between a high-bandwidth and a low-bandwidth interface. More packets arrive on the high-bandwidth link than we can send out on the other side, the low-bandwidth link. The router has to enqueue the packets which are to be sent on the low-bandwidth interface. Once the queue is full, the router has to drop packets.
</para>
<para>
Although there are several ways to influence this queue, in the end it's nothing more than deciding which packets are enqueued at which position inside the queue.
</para>
<para>
Please note, that you can always only influence the sending path.
</para>
</sect1>
<sect1 id="tc-linux">
<title>TC: Linux Traffic Control</title>
<para>
The traffic control code in the Linux kernel consists of the following major conceptual components:
<itemizedlist>
<listitem><para>queuing disciplines</para></listitem>
<listitem<para>classes (within a queuing discipline)</para></listitem>
<listitem><para>filters</para></listitem>
<listitem><para>policy</para></listitem>
</itemizedlist>
<para>
After the network stack inside the Linux kernel has made its routing decision, it knows on which network device the packet has to be sent out. Each network device has some information about how to enqueue the packets for this particular interface attached to its device structure. This queuing information is what the Linux developers called <indexterm id="qdisc"><primary>queuing discipline</primary></indexterm>queuing discipline.
</para>
<para>
A very simple queuing discipline ma just consist of a single queue, where all packets are stored in the order in which they have been enqueued, and which is emptied as fast as the respecitve network device can send.
</para>
<para>
More elaborate queuing disciplines ma use ##filters to disinguish among different ##classes of packes an process each class in a specific way, e.g. by giving one class priority over other classes.
<mediaobject>
<imageobject>
<imagedata fileref="qdisc_basic.gif" format="gif" width="100" scalefit="1">
</imageobject>
</mediaobject>
</para>
<para>
Queuing disciplines and classes are itimatel tied together: the presence of classes and their semantics are fundamental properties of the queuing discipline. In contrast to that, filters can be combined arbitrarily with queuing disciplines and classes as long as the queuing discipline does provide classes at all. To further increase flexibility, each class can use another queuing discipline for enqueuing the packets. This queuing discipline can, in turn, again have multiple classes which each have their own queuing discipline attached, etc.
<inlinemediaobject>
<imageobject>
<imagedata fileref="qdisc_sophisticated.png" format="gif">
</imageobject>
</inlinemediaobject>
</para>
<para>
All items inside TC are identified by a Handle. A handle consists out of a major and a minor number, seperated by a colon (example 10:0).
</para>
</sect1>
<sect1>
<title>Available queuing disciplines</title>
<para>
This chapter lists the currently available queuing disciplines an gives a short description of their functionality.
</para>
<sect2>
<title>Class Based Queue (CBQ)</title>
<para>
</para>
</sect2>
<sect2>
<title>Tocken Bucket Filter (TBF)</title>
<para>
The Token Bucket Filter (TBF) is a simple queue, that only passes packets arriving at rate in bounds of some administratively set rates, with possibility to buffer short bursts.
</para>
<para>
The TBF implementation consists of a buffer (bucket), constantly filled by some virtual pieces of information (called tokens) at a specific rate (called token rate). The most important parameter of the bucket is its size, that is the number of tokens it can store.
</para>
<para>
Each arriving token lets one data packet out of the queue and is then delete from the bucket. Associating this algorithm with the two floews - token and data, gives us three possible scenarios:
<itemizedlist>
<listitem>
<para>
Data arrives into TBF at a rate equal to the rate of incoming tokens. In this case each packet has its matchin token and passes the queue without futher delay.
</para>
</listitem>
<listitem>
<para>
Data arrives into TBF at a rate smaller than the token rate. Only some tokens are deleted from the bucket - one as each packet leaves - so tokens accumulate in the bucket, up to bucket size. The saved tokens can be used to send data in a higher rate than the token rate to compensate small bursts.
</para>
</listitem>
<listitem>
<para>
Data arrives at a rate higer than the token rate. In this case a filter overrun occurs - incoming data can only be sent out without loss until all accumulated tokens are used. After that, overlimit packets are dropped.
</para>
</listitem>
</sect2>
<sect2 id="tc-qdisc-cbq">
<title>Class Based Queue (CBQ)</title>
<para>
This queue discipline classifies the waiting packets into a tree-like hierarchy of classes. The leaves of this tree are in turn scheduled by seperate queue disciplines.
</para>
<para>
CBQ is a very commonly used scheduler. It is used as a basis for all the other queue disciplines.
</para>
</sect2>
<sect2 id="tc-qdisc-sfq">
<title>Stochastic Fairness Queuing (SFQ)</title>
<para>
SFQ is not quite deterministic, but works (on average). Its main benefits are that it requires little CPU and memory.
</para>
<para>
SFQ consists of a dynamically allocated number of FIFO queues, one for each conversation. A conversation (or flow) is distinguished by its source/destination IP address and port numberso. The discipline runs in round-robin, sending one packet from each FIFO in one turn, and this is why it's called fair. The main advantage of SFQ is that it allows fair sharing the link between different applications. It prevents bandwidth-takeover by one client / one application.
</para>
</sect2>
<sect2 id="tc-qdisc-pfifo">
<title>pfifo_fast</title>
<para>
The queue is, as the name says, first in, first out. That means that no packet receives any special treatment. At least, not quite. This qdisc has three so-called 'bands'. Within each band, FIFO rules apply. However, if there are packets waiting in band 0, band 1 won't be processed. Same goes for band 1 and band 2.
</para>
</sect2>
<sect2 id="tc-qdisc-red">
<title>Random Early Detect (RED)</title>
<para>
RED only works with TCP packets. It manipulates TCP's flow control (slow start). Once the link is filling up, it starts dropping packets. This indicates the TCP stack on the sending machine, that the link is congested, and the sender slows down. The clue is, that it simulates real congestion.
</para>
</sect2>
<sect2 id="tc-qdisc-ingres">
<title>Ingress policer</title>
<para>
The ingress policer implements a hard limit. You configure it to a specific rate, and all packets entering this queue exceeding the configured rate are dropped.
</para>
</sect2>
</sect1>
</chapter>
<appendix id="further-reading">
<title>Further Reading</title>
</appendix>
<appendix id="acknowledgements">
<title>Acknowledgements</title>
<para>
Although I wrote this document, I wasn't involved in any of the iproute2 / tc development. I am still baffled by the abstract, flexible concept it provides. My thanks go out to the iproute2+tc developers, especially Alexey Kuznetsov (our Linux networking god) and Werner Almesberger. Thanks to Rusty Russel, who inspired me at OLS2000 and LBW2000 to get involved more deeply with netfilter. I want to thank Andi Kleen and Marc Boucer, for having some really nice discussions on our meetings in Munich. Not to forget Bert Hubert and his team for writing the Linux 2.4 Advanced Routing HOWTO. Additional special thanks to the people who invented DocBook.
</para>
</appendix>
<glossary>
<title>Glossary</title>
<glossentry id="gloss-diffserv">
<glossterm>Differentiated Services</glossterm>
<acronym>DiffServ</acronym>
<glossdef><para>
DiffServ is one of two actual <glossterm linkend="gloss-qos">QoS</glossterm> implementations (the other one is called Integrated Services) that is based on a value carried by packets in the DS field of the IP header.
</para></glossdef>
</glossentry>
<glossentry id="gloss-ipchains">
<glossterm>ipchains</glossterm>
<glossdef><para>
The packet filtering system in Linux 2.2
</para></glossdef>
</glossentry>
<glossentry id="gloss-iptables">
<glossterm>ipchains</glossterm>
<glossdef><para>
The packet filtering system in Linux 2.4, based on <glossterm linkend="gloss-netfilter">netfilter</glossterm>.
</para></glossdef>
</glossentry>
<glossentry id="gloss-netfilter">
<glossterm>netfilter</glossterm>
<glossdef><para>
Common term used for the Linux 2.4 firewalling subsystem. To be more precies, it is the infrastructure underlying packet filtering, NAT and packet mangling.
</para></glossdef>
</glossentry>
<glossentry id="gloss-netlink">
<glossterm>Netlink Socket</glossterm>
<glossdef><para>
A special socket between kernel and userspace. Used by iproute2 to alter information in the routing tables, arp cache, policy routing database, ...
</para></glossdef>
</glossentry>
<glossentry id="gloss-ospf">
<glossterm>Open Shortest Path First</glossterm>
<acronym>OSPF</acronym>
<glossdef><para>
A dynamic routing protocol.
</para></glossdef>
</glossentry>
<glossentry id="gloss-qos">
<glossterm>Quality of Service</glossterm>
<acronym>QoS</acronym>
<glossdef><para>
Guaranteeing a certain bandwidth for specific applications
</para></glossdef>
</glossentry>
<glossentry id="gloss-rip">
<glossterm>Routing Information Protocol</glossterm>
<acronym>RIP</acronym>
<glossdef><para>
A dynamic routing protocol.
</para></glossdef>
</glossentry>
</glossary>
</para>
</book>
|