1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
|
%include "default.mgp"
%default 1 bgrad
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
%nodefault
%back "blue"
%center
%size 7
Hardware Selection
and Kernel Tuning
for High Performance Networking
Dec 07, 2006
SLAC, Berlin
%center
%size 4
by
Harald Welte <laforge@gnumonks.org>
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Network Performance & Tuning
About the Speaker
Who is speaking to you?
an independent Free Software developer
Linux kernel related consulting + development for 10 years
one of the authors of Linux kernel packet filter
busy with enforcing the GPL at gpl-violations.org
working on Free Software for smartphones (openezx.org)
...and Free Software for RFID (librfid)
...and Free Software for ePassports (libmrtd)
...and Free Hardware for RFID (openpcd.org, openbeacon.org)
...and the worlds first Open GSM Phone (openmoko.com)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Network Performance & Tuning
Hardware selection is important
Hardware selection is important
linux runs on about anything from a cellphone to a mainframe
good system performance depends on optimum selection of components
sysadmins and managers have to undestand importance of hardware choice
determine hardware needs before doing purchase !
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Network Performance & Tuning
Network usage patterns
Network usage patterns
TCP server workload (web server, ftp server, samba, nfs-tcp)
high-bandwidth TCP end-host performance
UDP server workload (nfs udp)
don't use it on gigabit speeds, data integrity problems!
Router (Packet filter / IPsec / ... ) workload
packet forwarding has fundamentally different requirements
none of the offloading tricks works in this case
important limit: pps, not bandwidth!
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Network Performance & Tuning
Contemporary PC hardware
Contemporary PC hardware
CPU often is extremely fast
2GHz CPU: 0.5nS clock cycle
L1/L2 cache access (four bytes): 2..3 clock cycles
everything that is not in L1 or L2 cache is like a disk access
40..180 clock cycles on Opteron (DDR-333)
250.460 clock cycles on Xeon (DDR-333)
I/O read
easily up to 3600 clock cycles for a register read on NIC
this happens synchronously, no other work can be executed!
disk access
don't talk about it. Like getting a coke from the moon.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Network Performance & Tuning
Hardware selection
Hardware selection
CPU
cache
as much cache as possible
shared cache (in multi-core setup) is great
SMP or not
problem: increased code complexity
problem: cache line ping-pong (on real SMP)
depends on workload
depends on number of interfaces!
Pro: IPsec, tc, complex routing
Con: NAT-only box
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Network Performance & Tuning
Hardware selection
Hardware selection
RAM
as fast as possible
use chipsets with highest possible speed
amd64 (Opteron, ..)
has per-cpu memory controller
doesn't waste system bus bandwidth for RAM access
Intel
has a traditional 'shared system bus' architecture
RAM is system-wide and not per-CPU
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Network Performance & Tuning
Hardware selection
Hardware selection
Bus architecture
as little bridges as possible
host bridge, PCI-X / PXE bridge + NIC chipset enough!
check bus speeds
real interrupts (PCI, PCI-X) have lower latency than message-signalled interrupts (MSI)
some boards use PCIe chipset and then additional PCIe-to-PCI-X bridge :(
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Network Performance & Tuning
Hardware selection
Hardware selection
NIC selection
NIC hardware
avoid additional bridges (fourport cards)
PCI-X: 64bit, highest clock rate, if possible (133MHz)
NIC driver support
many optional features
checksum offload
scatter gather DMA
segmentation offload (TSO/GSO)
interrupt flood behaviour (NAPI)
is the vendor supportive of the developers
Intel: e100/e1000 docs public!
is the vendor merging his patches mainline?
Syskonnect (bad) vs. Intel (good)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Network Performance & Tuning
Hardware selection
Hardware selection
hard disk
kernel network stack always is 100% resident in RAM
therefore, disk performance not important for network stack
however, one hint:
for SMTP servers, use battery buffered RAM disks (Gigabyte)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Network Performance & Tuning
Network Stack Tuning
Network Stack Tuning
hardware related
prevent multiple NICs from sharing one irq line
can be checked in /proc/interrupts
highly dependent on specific mainboard/chipset
configure irq affinity
in an SMP system, interrupts can be bound to one CPU
irq affinity should be set to assure all packets from one interface are handled on same CPU (cache locality)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Network Performance & Tuning
Network Stack Tuning
Network Stack Tuning
32bit or 64bit kernel?
most contemporary x86 systems support x86_64
biggest advantage: larger address space for kernel memory
however, problem: all pointers now 8bytes instead of 4
thus, increase of in-kernel data structures
thus, decreased cache efficiency
in packet forwarding applications, ca. 10% less performance
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Network Performance & Tuning
Network Stack Tuning
Network Stack Tuning
firewall specific
organize ruleset in tree shape rather than linear list
conntrack: hashsize / ip_conntrack_max
log: don't use syslog, rather ulogd-1.x or 2.x
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Network Performance & Tuning
Network Stack Tuning
Network Stack Tuning
local sockets
SO_SNDBUF / SO_RCVBUF should be used by apps
in recent 2.6.x kenrnels, they can override /proc/sys/net/ipv4/tcp_[rw]mem
on long fat pipes, increase /proc/sys/net/ipv4/tcp_adv_win_scale
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Network Performance & Tuning
Network Stack Tuning
Network Stack Tuning
core network stack
disable rp_filter, it adds lots of per-packet routing lookups
check linux-x.y.z/Documentation/networking/ip-sysctl.txt for more information
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Network Performance & Tuning
Links
Links
The Linux Advanced Routing and Traffic Control HOWTO
http://www.lartc.org/
The netdev mailinglist
netdev@vger.kernel.org
|